756 research outputs found
Distributed Stochastic Approximation for Solving Network Optimization Problems Under Random Quantization
We study distributed optimization problems over a network when the
communication between the nodes is constrained, and so information that is
exchanged between the nodes must be quantized. This imperfect communication
poses a fundamental challenge, and this imperfect communication, if not
properly accounted for, prevents the convergence of these algorithms. Our first
contribution in this paper is to propose a modified consensus-based gradient
method for solving such problems using random (dithered) quantization. This
algorithm can be interpreted as a distributed variant of a well-known
two-time-scale stochastic algorithm. We then study the convergence and derive
upper bounds on the rates of convergence of the proposed method as a function
of the bandwidths available between the nodes and the underlying network
topology, for both convex and strongly convex objective functions. Our results
complement for existing literature where such convergence and explicit formulas
of the convergence rates are missing. Finally, we provide numerical simulations
to compare the convergence properties of the distributed gradient methods with
and without quantization for solving the well-known regression problems over
networks, for both quadratic and absolute loss functions
Fast Convergence Rates of Distributed Subgradient Methods with Adaptive Quantization
We study distributed optimization problems over a network when the
communication between the nodes is constrained, and so information that is
exchanged between the nodes must be quantized. Recent advances using the
distributed gradient algorithm with a quantization scheme at a fixed resolution
have established convergence, but at rates significantly slower than when the
communications are unquantized.
In this paper, we introduce a novel quantization method, which we refer to as
adaptive quantization, that allows us to match the convergence rates under
perfect communications. Our approach adjusts the quantization scheme used by
each node as the algorithm progresses: as we approach the solution, we become
more certain about where the state variables are localized, and adapt the
quantizer codebook accordingly.
We bound the convergence rates of the proposed method as a function of the
communication bandwidth, the underlying network topology, and structural
properties of the constituent objective functions. In particular, we show that
if the objective functions are convex or strongly convex, then using adaptive
quantization does not affect the rate of convergence of the distributed
subgradient methods when the communications are quantized, except for a
constant that depends on the resolution of the quantizer. To the best of our
knowledge, the rates achieved in this paper are better than any existing work
in the literature for distributed gradient methods under finite communication
bandwidths. We also provide numerical simulations that compare convergence
properties of the distributed gradient methods with and without quantization
for solving distributed regression problems for both quadratic and absolute
loss functions.Comment: arXiv admin note: text overlap with arXiv:1810.1156
Finite-Time Performance of Distributed Two-Time-Scale Stochastic Approximation
Two-time-scale stochastic approximation is a popular iterative method for
finding the solution of a system of two equations. Such methods have found
broad applications in many areas, especially in machine learning and
reinforcement learning. In this paper, we propose a distributed variant of this
method over a network of agents, where the agents use two graphs representing
their communication at different speeds due to the nature of their
two-time-scale updates. Our main contribution is to provide a finite-time
analysis for the performance of the proposed method. In particular, we
establish an upper bound for the convergence rates of the mean square errors at
the agents to zero as a function of the step sizes and the network topology
A Two-Time-Scale Stochastic Optimization Framework with Applications in Control and Reinforcement Learning
We study a novel two-time-scale stochastic gradient method for solving
optimization problems where the gradient samples are generated from a
time-varying Markov random process parameterized by the underlying optimization
variable. These time-varying samples make the stochastic gradient biased and
dependent, which can potentially lead to the divergence of the iterates. To
address this issue, we consider a two-time-scale update scheme, where one scale
is used to estimate the true gradient from the Markovian samples and the other
scale is used to update the decision variable with the estimated gradient.
While these two iterates are implemented simultaneously, the former is updated
"faster" (using bigger step sizes) than the latter (using smaller step sizes).
Our first contribution is to characterize the finite-time complexity of the
proposed two-time-scale stochastic gradient method. In particular, we provide
explicit formulas for the convergence rates of this method under different
objective functions, namely, strong convexity, convexity, non-convexity under
the PL condition, and general non-convexity.
Our second contribution is to apply our framework to study the performance of
the popular actor-critic methods in solving stochastic control and
reinforcement learning problems. First, we study an online natural actor-critic
algorithm for the linear-quadratic regulator and show that a convergence rate
of is achieved. This is the first time such a result is
known in the literature. Second, we look at the standard online actor-critic
algorithm over finite state and action spaces and derive a convergence rate of
, which recovers the best known rate derived
specifically for this problem. Finally, we support our theoretical analysis
with numerical simulations where the convergence rate is visualized
Finite-Time Performance of Distributed Temporal Difference Learning with Linear Function Approximation
We study the policy evaluation problem in multi-agent reinforcement learning,
modeled by a Markov decision process. In this problem, the agents operate in a
common environment under a fixed control policy, working together to discover
the value (global discounted accumulative reward) associated with each
environmental state. Over a series of time steps, the agents act, get rewarded,
update their local estimate of the value function, then communicate with their
neighbors. The local update at each agent can be interpreted as a distributed
variant of the popular temporal difference learning methods {\sf TD}.
Our main contribution is to provide a finite-analysis on the performance of
this distributed {\sf TD} algorithm for both constant and
time-varying step sizes. The key idea in our analysis is to use the geometric
mixing time of the underlying Markov chain, that is, although the
"noise" in our algorithm is Markovian, its dependence is very weak at samples
spaced out at every . We provide an explicit upper bound on the
convergence rate of the proposed method as a function of the network topology,
the discount factor, the constant , and the mixing time .
Our results also provide a mathematical explanation for observations that
have appeared previously in the literature about the choice of . Our
upper bound illustrates the trade-off between approximation accuracy and
convergence speed implicit in the choice of . When , the
solution will correspond to the best possible approximation of the value
function, while choosing leads to faster convergence when the
noise in the algorithm has large variance.Comment: arXiv admin note: text overlap with arXiv:1902.0739
Finite-Time Analysis of Distributed TD(0) with Linear Function Approximation for Multi-Agent Reinforcement Learning
We study the policy evaluation problem in multi-agent reinforcement learning.
In this problem, a group of agents works cooperatively to evaluate the value
function for the global discounted accumulative reward problem, which is
composed of local rewards observed by the agents. Over a series of time steps,
the agents act, get rewarded, update their local estimate of the value
function, then communicate with their neighbors. The local update at each agent
can be interpreted as a distributed consensus-based variant of the popular
temporal difference learning algorithm TD(0).
While distributed reinforcement learning algorithms have been presented in
the literature, almost nothing is known about their convergence rate. Our main
contribution is providing a finite-time analysis for the convergence of the
distributed TD(0) algorithm. We do this when the communication network between
the agents is time-varying in general. We obtain an explicit upper bound on the
rate of convergence of this algorithm as a function of the network topology and
the discount factor. Our results mirror what we would expect from using
distributed stochastic gradient descent for solving convex optimization
problems
A Reinforcement Learning Framework for Sequencing Multi-Robot Behaviors
Given a list of behaviors and associated parameterized controllers for
solving different individual tasks, we study the problem of selecting an
optimal sequence of coordinated behaviors in multi-robot systems for completing
a given mission, which could not be handled by any single behavior. In
addition, we are interested in the case where partial information of the
underlying mission is unknown, therefore, the robots must cooperatively learn
this information through their course of actions. Such problem can be
formulated as an optimal decision problem in multi-robot systems, however, it
is in general intractable due to modeling imperfections and the curse of
dimensionality of the decision variables. To circumvent these issues, we first
consider an alternate formulation of the original problem through introducing a
sequence of behaviors' switching times. Our main contribution is then to
propose a novel reinforcement learning based method, that combines Q-learning
and online gradient descent, for solving this reformulated problem. In
particular, the optimal sequence of the robots' behaviors is found by using
Q-learning while the optimal parameters of the associated controllers are
obtained through an online gradient descent method. Finally, to illustrate the
effectiveness of our proposed method we implement it on a team of
differential-drive robots for solving two different missions, namely, convoy
protection and object manipulation.Comment: 6 page
A Decentralized Policy Gradient Approach to Multi-task Reinforcement Learning
We develop a mathematical framework for solving multi-task reinforcement
learning (MTRL) problems based on a type of policy gradient method. The goal in
MTRL is to learn a common policy that operates effectively in different
environments; these environments have similar (or overlapping) state spaces,
but have different rewards and dynamics. We highlight two fundamental
challenges in MTRL that are not present in its single task counterpart, and
illustrate them with simple examples. We then develop a decentralized
entropy-regularized policy gradient method for solving the MTRL problem, and
study its finite-time convergence rate. We demonstrate the effectiveness of the
proposed method using a series of numerical experiments. These experiments
range from small-scale "GridWorld" problems that readily demonstrate the
trade-offs involved in multi-task learning to large-scale problems, where
common policies are learned to navigate an airborne drone in multiple
(simulated) environments
Finite Sample Analysis of Two-Time-Scale Natural Actor-Critic Algorithm
Actor-critic style two-time-scale algorithms are very popular in
reinforcement learning, and have seen great empirical success. However, their
performance is not completely understood theoretically. In this paper, we
characterize the global convergence of an online natural actor-critic algorithm
in the tabular setting using a single trajectory. Our analysis applies to very
general settings, as we only assume that the underlying Markov chain is ergodic
under all policies (the so-called Recurrence assumption). We employ
-greedy sampling in order to ensure enough exploration.
For a fixed exploration parameter , we show that the natural actor
critic algorithm is close to
the global optimum after iterations of the algorithm.
By carefully diminishing the exploration parameter as the
iterations proceed, we also show convergence to the global optimum at a rate of
.Comment: 34 pages, 6 figure
Convergence Rates of Accelerated Markov Gradient Descent with Applications in Reinforcement Learning
Motivated by broad applications in machine learning, we study the popular
accelerated stochastic gradient descent (ASGD) algorithm for solving (possibly
nonconvex) optimization problems. We characterize the finite-time performance
of this method when the gradients are sampled from Markov processes, and hence
biased and dependent from time step to time step; in contrast, the analysis in
existing work relies heavily on the stochastic gradients being independent and
sometimes unbiased. Our main contributions show that under certain (standard)
assumptions on the underlying Markov chain generating the gradients, ASGD
converges at the nearly the same rate with Markovian gradient samples as with
independent gradient samples. The only difference is a logarithmic factor that
accounts for the mixing time of the Markov chain.
One of the key motivations for this study are complicated control problems
that can be modeled by a Markov decision process and solved using reinforcement
learning. We apply the accelerated method to several challenging problems in
the OpenAI Gym and Mujoco, and show that acceleration can significantly improve
the performance of the classic REINFORCE algorithm
- …