13,986 research outputs found
Learning-based Control of Unknown Linear Systems with Thompson Sampling
We propose a Thompson sampling-based learning algorithm for the Linear
Quadratic (LQ) control problem with unknown system parameters. The algorithm is
called Thompson sampling with dynamic episodes (TSDE) where two stopping
criteria determine the lengths of the dynamic episodes in Thompson sampling.
The first stopping criterion controls the growth rate of episode length. The
second stopping criterion is triggered when the determinant of the sample
covariance matrix is less than half of the previous value. We show under some
conditions on the prior distribution that the expected (Bayesian) regret of
TSDE accumulated up to time T is bounded by O(\sqrt{T}). Here O(.) hides
constants and logarithmic factors. This is the first O(\sqrt{T} ) bound on
expected regret of learning in LQ control. By introducing a reinitialization
schedule, we also show that the algorithm is robust to time-varying drift in
model parameters. Numerical simulations are provided to illustrate the
performance of TSDE
A Tour of Reinforcement Learning: The View from Continuous Control
This manuscript surveys reinforcement learning from the perspective of
optimization and control with a focus on continuous control applications. It
surveys the general formulation, terminology, and typical experimental
implementations of reinforcement learning and reviews competing solution
paradigms. In order to compare the relative merits of various techniques, this
survey presents a case study of the Linear Quadratic Regulator (LQR) with
unknown dynamics, perhaps the simplest and best-studied problem in optimal
control. The manuscript describes how merging techniques from learning theory
and control can provide non-asymptotic characterizations of LQR performance and
shows that these characterizations tend to match experimental behavior. In
turn, when revisiting more complex applications, many of the observed phenomena
in LQR persist. In particular, theory and experiment demonstrate the role and
importance of models and the cost of generality in reinforcement learning
algorithms. This survey concludes with a discussion of some of the challenges
in designing learning systems that safely and reliably interact with complex
and uncertain environments and how tools from reinforcement learning and
control might be combined to approach these challenges.Comment: minor revision with a few clarifying passages and corrected typo
Extragradient method with variance reduction for stochastic variational inequalities
We propose an extragradient method with stepsizes bounded away from zero for
stochastic variational inequalities requiring only pseudo-monotonicity. We
provide convergence and complexity analysis, allowing for an unbounded feasible
set, unbounded operator, non-uniform variance of the oracle and, also, we do
not require any regularization. Alongside the stochastic approximation
procedure, we iteratively reduce the variance of the stochastic error. Our
method attains the optimal oracle complexity (up to
a logarithmic term) and a faster rate in terms of the mean
(quadratic) natural residual and the D-gap function, where is the number of
iterations required for a given tolerance . Such convergence rate
represents an acceleration with respect to the stochastic error. The generated
sequence also enjoys a new feature: the sequence is bounded in if the
stochastic error has finite -moment. Explicit estimates for the convergence
rate, the oracle complexity and the -moments are given depending on problem
parameters and distance of the initial iterate to the solution set. Moreover,
sharper constants are possible if the variance is uniform over the solution set
or the feasible set. Our results provide new classes of stochastic variational
inequalities for which a convergence rate of holds in terms
of the mean-squared distance to the solution set. Our analysis includes the
distributed solution of pseudo-monotone Cartesian variational inequalities
under partial coordination of parameters between users of a network.Comment: 39 pages. To appear in SIAM Journal on Optimization (submitted July
2015, accepted December 2016). Uploaded in IMPA's preprint server at
http://preprint.impa.br/visualizar?id=688
Estimation Considerations in Contextual Bandits
Contextual bandit algorithms are sensitive to the estimation method of the
outcome model as well as the exploration method used, particularly in the
presence of rich heterogeneity or complex outcome models, which can lead to
difficult estimation problems along the path of learning. We study a
consideration for the exploration vs. exploitation framework that does not
arise in multi-armed bandits but is crucial in contextual bandits; the way
exploration and exploitation is conducted in the present affects the bias and
variance in the potential outcome model estimation in subsequent stages of
learning. We develop parametric and non-parametric contextual bandits that
integrate balancing methods from the causal inference literature in their
estimation to make it less prone to problems of estimation bias. We provide the
first regret bound analyses for contextual bandits with balancing in the domain
of linear contextual bandits that match the state of the art regret bounds. We
demonstrate the strong practical advantage of balanced contextual bandits on a
large number of supervised learning datasets and on a synthetic example that
simulates model mis-specification and prejudice in the initial training data.
Additionally, we develop contextual bandits with simpler assignment policies by
leveraging sparse model estimation methods from the econometrics literature and
demonstrate empirically that in the early stages they can improve the rate of
learning and decrease regret
Spectral approximation properties of isogeometric analysis with variable continuity
We study the spectral approximation properties of isogeometric analysis with
local continuity reduction of the basis. Such continuity reduction results in a
reduction in the interconnection between the degrees of freedom of the mesh,
which allows for large savings in computational requirements during the
solution of the resulting linear system. The continuity reduction results in
extra degrees of freedom that modify the approximation properties of the
method. The convergence rate of such refined isogeometric analysis is
equivalent to that of the maximum continuity basis. We show how the breaks in
continuity and inhomogeneity of the basis lead to artefacts in the frequency
spectra, such as stopping bands and outliers, and present a unified description
of these effects in finite element method, isogeometric analysis, and refined
isogeometric analysis. Accuracy of the refined isogeometric analysis
approximations can be improved by using non-standard quadrature rules. In
particular, optimal quadrature rules lead to large reductions in the eigenvalue
errors and yield two extra orders of convergence similar to classical
isogeometric analysis
Optimal Reinforcement Learning for Gaussian Systems
The exploration-exploitation trade-off is among the central challenges of
reinforcement learning. The optimal Bayesian solution is intractable in
general. This paper studies to what extent analytic statements about optimal
learning are possible if all beliefs are Gaussian processes. A first order
approximation of learning of both loss and dynamics, for nonlinear,
time-varying systems in continuous time and space, subject to a relatively weak
restriction on the dynamics, is described by an infinite-dimensional partial
differential equation. An approximate finite-dimensional projection gives an
impression for how this result may be helpful.Comment: final pre-conference version of this NIPS 2011 paper. Once again,
please note some nontrivial changes to exposition and interpretation of the
results, in particular in Equation (9) and Eqs. 11-14. The algorithm and
results have remained the same, but their theoretical interpretation has
change
Horde of Bandits using Gaussian Markov Random Fields
The gang of bandits (GOB) model \cite{cesa2013gang} is a recent contextual
bandits framework that shares information between a set of bandit problems,
related by a known (possibly noisy) graph. This model is useful in problems
like recommender systems where the large number of users makes it vital to
transfer information between users. Despite its effectiveness, the existing GOB
model can only be applied to small problems due to its quadratic
time-dependence on the number of nodes. Existing solutions to combat the
scalability issue require an often-unrealistic clustering assumption. By
exploiting a connection to Gaussian Markov random fields (GMRFs), we show that
the GOB model can be made to scale to much larger graphs without additional
assumptions. In addition, we propose a Thompson sampling algorithm which uses
the recent GMRF sampling-by-perturbation technique, allowing it to scale to
even larger problems (leading to a "horde" of bandits). We give regret bounds
and experimental results for GOB with Thompson sampling and epoch-greedy
algorithms, indicating that these methods are as good as or significantly
better than ignoring the graph or adopting a clustering-based approach.
Finally, when an existing graph is not available, we propose a heuristic for
learning it on the fly and show promising results
Exploration versus exploitation in reinforcement learning: a stochastic control approach
We consider reinforcement learning (RL) in continuous time and study the
problem of achieving the best trade-off between exploration of a black box
environment and exploitation of current knowledge. We propose an
entropy-regularized reward function involving the differential entropy of the
distributions of actions, and motivate and devise an exploratory formulation
for the feature dynamics that captures repetitive learning under exploration.
The resulting optimization problem is a revitalization of the classical relaxed
stochastic control. We carry out a complete analysis of the problem in the
linear--quadratic (LQ) setting and deduce that the optimal feedback control
distribution for balancing exploitation and exploration is Gaussian. This in
turn interprets and justifies the widely adopted Gaussian exploration in RL,
beyond its simplicity for sampling. Moreover, the exploitation and exploration
are captured, respectively and mutual-exclusively, by the mean and variance of
the Gaussian distribution. We also find that a more random environment contains
more learning opportunities in the sense that less exploration is needed. We
characterize the cost of exploration, which, for the LQ case, is shown to be
proportional to the entropy regularization weight and inversely proportional to
the discount rate. Finally, as the weight of exploration decays to zero, we
prove the convergence of the solution of the entropy-regularized LQ problem to
the one of the classical LQ problem
Posterior Sampling for Large Scale Reinforcement Learning
We propose a practical non-episodic PSRL algorithm that unlike recent
state-of-the-art PSRL algorithms uses a deterministic, model-independent
episode switching schedule. Our algorithm termed deterministic schedule PSRL
(DS-PSRL) is efficient in terms of time, sample, and space complexity. We prove
a Bayesian regret bound under mild assumptions. Our result is more generally
applicable to multiple parameters and continuous state action problems. We
compare our algorithm with state-of-the-art PSRL algorithms on standard
discrete and continuous problems from the literature. Finally, we show how the
assumptions of our algorithm satisfy a sensible parametrization for a large
class of problems in sequential recommendations
Probabilistic Programming with Gaussian Process Memoization
Gaussian Processes (GPs) are widely used tools in statistics, machine
learning, robotics, computer vision, and scientific computation. However,
despite their popularity, they can be difficult to apply; all but the simplest
classification or regression applications require specification and inference
over complex covariance functions that do not admit simple analytical
posteriors. This paper shows how to embed Gaussian processes in any
higher-order probabilistic programming language, using an idiom based on
memoization, and demonstrates its utility by implementing and extending classic
and state-of-the-art GP applications. The interface to Gaussian processes,
called gpmem, takes an arbitrary real-valued computational process as input and
returns a statistical emulator that automatically improve as the original
process is invoked and its input-output behavior is recorded. The flexibility
of gpmem is illustrated via three applications: (i) robust GP regression with
hierarchical hyper-parameter learning, (ii) discovering symbolic expressions
from time-series data by fully Bayesian structure learning over kernels
generated by a stochastic grammar, and (iii) a bandit formulation of Bayesian
optimization with automatic inference and action selection. All applications
share a single 50-line Python library and require fewer than 20 lines of
probabilistic code each.Comment: 36 pages, 9 figure
- …