Search CORE

61,070 research outputs found

High-confidence error estimates for learned value functions

Author: Chung Wesley
Sajed Touqir
White Martha
Publication venue
Publication date: 28/08/2018
Field of study

Estimating the value function for a fixed policy is a fundamental problem in reinforcement learning. Policy evaluation algorithms---to estimate value functions---continue to be developed, to improve convergence rates, improve stability and handle variability, particularly for off-policy learning. To understand the properties of these algorithms, the experimenter needs high-confidence estimates of the accuracy of the learned value functions. For environments with small, finite state-spaces, like chains, the true value function can be easily computed, to compute accuracy. For large, or continuous state-spaces, however, this is no longer feasible. In this paper, we address the largely open problem of how to obtain these high-confidence estimates, for general state-spaces. We provide a high-confidence bound on an empirical estimate of the value error to the true value error. We use this bound to design an offline sampling algorithm, which stores the required quantities to repeatedly compute value error estimates for any learned value function. We provide experiments investigating the number of samples required by this offline algorithm in simple benchmark reinforcement learning domains, and highlight that there are still many open questions to be solved for this important problem.Comment: Presented at (UAI) Uncertainty in Artificial Intelligence 201

arXiv.org e-Print Archive

Adaptively Truncating Backpropagation Through Time to Control Gradient Bias

Author: Aicher Christopher
Foti Nicholas J.
Fox Emily B.
Publication venue
Publication date: 01/07/2019
Field of study

Truncated backpropagation through time (TBPTT) is a popular method for learning in recurrent neural networks (RNNs) that saves computation and memory at the cost of bias by truncating backpropagation after a fixed number of lags. In practice, choosing the optimal truncation length is difficult: TBPTT will not converge if the truncation length is too small, or will converge slowly if it is too large. We propose an adaptive TBPTT scheme that converts the problem from choosing a temporal lag to one of choosing a tolerable amount of gradient bias. For many realistic RNNs, the TBPTT gradients decay geometrically in expectation for large lags; under this condition, we can control the bias by varying the truncation length adaptively. For RNNs with smooth activation functions, we prove that this bias controls the convergence rate of SGD with biased gradients for our non-convex loss. Using this theory, we develop a practical method for adaptively estimating the truncation length during training. We evaluate our adaptive TBPTT method on synthetic data and language modeling tasks and find that our adaptive TBPTT ameliorates the computational pitfalls of fixed TBPTT

arXiv.org e-Print Archive

A Finite Time Analysis of Temporal Difference Learning With Linear Function Approximation

Author: Bhandari Jalaj
Russo Daniel
Singal Raghav
Publication venue
Publication date: 06/11/2018
Field of study

Temporal difference learning (TD) is a simple iterative algorithm used to estimate the value function corresponding to a given policy in a Markov decision process. Although TD is one of the most widely used algorithms in reinforcement learning, its theoretical analysis has proved challenging and few guarantees on its statistical efficiency are available. In this work, we provide a simple and explicit finite time analysis of temporal difference learning with linear function approximation. Except for a few key insights, our analysis mirrors standard techniques for analyzing stochastic gradient descent algorithms, and therefore inherits the simplicity and elegance of that literature. Final sections of the paper show how all of our main results extend to the study of TD learning with eligibility traces, known as TD(

\lambda

), and to Q-learning applied in high-dimensional optimal stopping problems

arXiv.org e-Print Archive

Classification-based Approximate Policy Iteration: Experiments and Extended Discussions

Author: Barreto André M. S.
Farahmand Amir-massoud
Ghavamzadeh Mohammad
Precup Doina
Publication venue
Publication date: 01/07/2014
Field of study

Tackling large approximate dynamic programming or reinforcement learning problems requires methods that can exploit regularities, or intrinsic structure, of the problem in hand. Most current methods are geared towards exploiting the regularities of either the value function or the policy. We introduce a general classification-based approximate policy iteration (CAPI) framework, which encompasses a large class of algorithms that can exploit regularities of both the value function and the policy space, depending on what is advantageous. This framework has two main components: a generic value function estimator and a classifier that learns a policy based on the estimated value function. We establish theoretical guarantees for the sample complexity of CAPI-style algorithms, which allow the policy evaluation step to be performed by a wide variety of algorithms (including temporal-difference-style methods), and can handle nonparametric representations of policies. Our bounds on the estimation error of the performance loss are tighter than existing results. We also illustrate this approach empirically on several problems, including a large HIV control task

arXiv.org e-Print Archive

A Self-Adaptive Proposal Model for Temporal Action Detection based on Reinforcement Learning

Author: Huang Jingjia
Li Ge
Li Nannan
Zhang Tao
Publication venue
Publication date: 22/06/2017
Field of study

Existing action detection algorithms usually generate action proposals through an extensive search over the video at multiple temporal scales, which brings about huge computational overhead and deviates from the human perception procedure. We argue that the process of detecting actions should be naturally one of observation and refinement: observe the current window and refine the span of attended window to cover true action regions. In this paper, we propose an active action proposal model that learns to find actions through continuously adjusting the temporal bounds in a self-adaptive way. The whole process can be deemed as an agent, which is firstly placed at a position in the video at random, adopts a sequence of transformations on the current attended region to discover actions according to a learned policy. We utilize reinforcement learning, especially the Deep Q-learning algorithm to learn the agent's decision policy. In addition, we use temporal pooling operation to extract more effective feature representation for the long temporal window, and design a regression network to adjust the position offsets between predicted results and the ground truth. Experiment results on THUMOS 2014 validate the effectiveness of the proposed approach, which can achieve competitive performance with current action detection algorithms via much fewer proposals.Comment: Deep Reinforcement Learning, Action Temporal Detection, Temporal Location Regressio

arXiv.org e-Print Archive

Sequential anatomy localization in fetal echocardiography videos

Author: Noble J. A.
Patra Arijit
Publication venue
Publication date: 20/12/2018
Field of study

Fetal heart motion is an important diagnostic indicator for structural detection and functional assessment of congenital heart disease. We propose an approach towards integrating deep convolutional and recurrent architectures that utilize localized spatial and temporal features of different anatomical substructures within a global spatiotemporal context for interpretation of fetal echocardiography videos. We formulate our task as a cardiac structure localization problem with convolutional architectures for aggregating global spatial context and detecting anatomical structures on spatial region proposals. This information is aggregated temporally by recurrent architectures to quantify the progressive motion patterns. We experimentally show that the resulting architecture combines anatomical landmark detection at the frame-level over multiple video sequences-with temporal progress of the associated anatomical motions to encode local spatiotemporal fetal heart dynamics and is validated on a real-world clinical dataset.Comment: To appear in ISBI 201

arXiv.org e-Print Archive

Online Monotone Games

Author: Gemp Ian
Mahadevan Sridhar
Publication venue
Publication date: 19/10/2017
Field of study

Algorithmic game theory (AGT) focuses on the design and analysis of algorithms for interacting agents, with interactions rigorously formalized within the framework of games. Results from AGT find applications in domains such as online bidding auctions for web advertisements and network routing protocols. Monotone games are games where agent strategies naturally converge to an equilibrium state. Previous results in AGT have been obtained for convex, socially-convex, or smooth games, but not monotone games. Our primary theoretical contributions are defining the monotone game setting and its extension to the online setting, a new notion of regret for this setting, and accompanying algorithms that achieve sub-linear regret. We demonstrate the utility of online monotone game theory on a variety of problem domains including variational inequalities, reinforcement learning, and generative adversarial networks

arXiv.org e-Print Archive

End-to-end Learning of Action Detection from Frame Glimpses in Videos

Author: Fei-Fei Li
Mori Greg
Russakovsky Olga
Yeung Serena
Publication venue
Publication date: 13/03/2017
Field of study

In this work we introduce a fully end-to-end approach for action detection in videos that learns to directly predict the temporal bounds of actions. Our intuition is that the process of detecting actions is naturally one of observation and refinement: observing moments in video, and refining hypotheses about when an action is occurring. Based on this insight, we formulate our model as a recurrent neural network-based agent that interacts with a video over time. The agent observes video frames and decides both where to look next and when to emit a prediction. Since backpropagation is not adequate in this non-differentiable setting, we use REINFORCE to learn the agent's decision policy. Our model achieves state-of-the-art results on the THUMOS'14 and ActivityNet datasets while observing only a fraction (2% or less) of the video frames.Comment: Update to version in CVPR 2016 proceeding

arXiv.org e-Print Archive

Nonparametric risk bounds for time-series forecasting

Author: McDonald Daniel J.
Schervish Mark
Shalizi Cosma Rohilla
Publication venue
Publication date: 10/09/2016
Field of study

We derive generalization error bounds for traditional time-series forecasting models. Our results hold for many standard forecasting tools including autoregressive models, moving average models, and, more generally, linear state-space models. These non-asymptotic bounds need only weak assumptions on the data-generating process, yet allow forecasters to select among competing models and to guarantee, with high probability, that their chosen model will perform well. We motivate our techniques with and apply them to standard economic and financial forecasting tools---a GARCH model for predicting equity volatility and a dynamic stochastic general equilibrium model (DSGE), the standard tool in macroeconomic forecasting. We demonstrate in particular how our techniques can aid forecasters and policy makers in choosing models which behave well under uncertainty and mis-specification.Comment: 34 pages, 3 figure

arXiv.org e-Print Archive

Memory shapes time perception and intertemporal choices

Author: Ortega Pedro A.
Tishby Naftali
Publication venue
Publication date: 29/05/2016
Field of study

There is a consensus that human and non-human subjects experience temporal distortions in many stages of their perceptual and decision-making systems. Similarly, intertemporal choice research has shown that decision-makers undervalue future outcomes relative to immediate ones. Here we combine techniques from information theory and artificial intelligence to show how both temporal distortions and intertemporal choice preferences can be explained as a consequence of the coding efficiency of sensorimotor representation. In particular, the model implies that interactions that constrain future behavior are perceived as being both longer in duration and more valuable. Furthermore, using simulations of artificial agents, we investigate how memory constraints enforce a renormalization of the perceived timescales. Our results show that qualitatively different discount functions, such as exponential and hyperbolic discounting, arise as a consequence of an agent's probabilistic model of the world.Comment: 24 pages, 4 figures, 2 tables. Submitte

arXiv.org e-Print Archive