13 research outputs found
Unifying Two Views on Multiple Mean-Payoff Objectives in Markov Decision Processes
We consider Markov decision processes (MDPs) with multiple limit-average (or
mean-payoff) objectives. There exist two different views: (i) the expectation
semantics, where the goal is to optimize the expected mean-payoff objective,
and (ii) the satisfaction semantics, where the goal is to maximize the
probability of runs such that the mean-payoff value stays above a given vector.
We consider optimization with respect to both objectives at once, thus unifying
the existing semantics. Precisely, the goal is to optimize the expectation
while ensuring the satisfaction constraint. Our problem captures the notion of
optimization with respect to strategies that are risk-averse (i.e., ensure
certain probabilistic guarantee). Our main results are as follows: First, we
present algorithms for the decision problems which are always polynomial in the
size of the MDP. We also show that an approximation of the Pareto-curve can be
computed in time polynomial in the size of the MDP, and the approximation
factor, but exponential in the number of dimensions. Second, we present a
complete characterization of the strategy complexity (in terms of memory bounds
and randomization) required to solve our problem.Comment: Extended journal version of the LICS'15 pape
Convergence of Batch Asynchronous Stochastic Approximation With Applications to Reinforcement Learning
The stochastic approximation algorithm is a widely used probabilistic method
for finding a zero of a vector-valued funtion, when only noisy measurements of
the function are available. In the literature to date, one can make a
distinction between "synchronous" updating, whereby every component of the
current guess is updated at each time, and `"synchronous" updating, whereby
only one component is updated. In principle, it is also possible to update, at
each time instant, some but not all components of , which might be
termed as "batch asynchronous stochastic approximation" (BASA). Also, one can
also make a distinction between using a "local" clock versus a "global" clock.
In this paper, we propose a unified formulation of batch asynchronous
stochastic approximation (BASA) algorithms, and develop a general methodology
for proving that such algorithms converge, irrespective of whether global or
local clocks are used. These convergence proofs make use of weaker hypotheses
than existing results. For example: existing convergence proofs when a local
clock is used require that the measurement noise is an i.i.d sequence. Here, it
is assumed that the measurement errors form a martingale difference sequence.
Also, all results to date assume that the stochastic step sizes satisfy a
probabilistic analog of the Robbins-Monro conditions. We replace this by a
purely deterministic condition on the irreducibility of the underlying Markov
processes.
As specific applications to Reinforcement Learning, we introduce ``batch''
versions of the temporal difference algorithm for value iteration, and
the -learning algorithm for finding the optimal action-value function, and
also permit the use of local clocks instead of a global clock. In all cases, we
establish the convergence of these algorithms, under milder conditions than in
the existing literature.Comment: 27 page
Queue-Aware Dynamic Clustering and Power Allocation for Network MIMO Systems via Distributive Stochastic Learning
In this paper, we propose a two-timescale delay-optimal dynamic clustering
and power allocation design for downlink network MIMO systems. The dynamic
clustering control is adaptive to the global queue state information (GQSI)
only and computed at the base station controller (BSC) over a longer time
scale. On the other hand, the power allocations of all the BSs in one cluster
are adaptive to both intra-cluster channel state information (CCSI) and
intra-cluster queue state information (CQSI), and computed at the cluster
manager (CM) over a shorter time scale. We show that the two-timescale
delay-optimal control can be formulated as an infinite-horizon average cost
Constrained Partially Observed Markov Decision Process (CPOMDP). By exploiting
the special problem structure, we shall derive an equivalent Bellman equation
in terms of Pattern Selection Q-factor to solve the CPOMDP. To address the
distributive requirement and the issue of exponential memory requirement and
computational complexity, we approximate the Pattern Selection Q-factor by the
sum of Per-cluster Potential functions and propose a novel distributive online
learning algorithm to estimate the Per-cluster Potential functions (at each CM)
as well as the Lagrange multipliers (LM) (at each BS). We show that the
proposed distributive online learning algorithm converges almost surely (with
probability 1). By exploiting the birth-death structure of the queue dynamics,
we further decompose the Per-cluster Potential function into sum of Per-cluster
Per-user Potential functions and formulate the instantaneous power allocation
as a Per-stage QSI-aware Interference Game played among all the CMs. We also
propose a QSI-aware Simultaneous Iterative Water-filling Algorithm (QSIWFA) and
show that it can achieve the Nash Equilibrium (NE)
IST Austria Technical Report
We consider Markov decision processes (MDPs) with multiple limit-average (or mean-payoff) objectives.
There have been two different views: (i) the expectation semantics, where the goal is to optimize the expected mean-payoff objective, and (ii) the satisfaction semantics, where the goal is to maximize the probability of runs such that the mean-payoff value stays above a given vector.
We consider the problem where the goal is to optimize the expectation under the constraint that the satisfaction semantics is ensured, and thus consider a generalization that unifies the existing semantics. Our problem captures the notion of optimization with respect to strategies that are risk-averse (i.e., ensures certain probabilistic guarantee).
Our main results are algorithms for the decision problem which are always polynomial in the size of the MDP.
We also show that an approximation of the Pareto-curve can be computed in time polynomial in the size of the MDP, and the approximation factor, but exponential in the number of dimensions. Finally, we present a complete characterization of the strategy complexity (in terms of memory bounds and randomization) required to solve our problem
Age of Semantics in Cooperative Communications: To Expedite Simulation Towards Real via Offline Reinforcement Learning
The age of information metric fails to correctly describe the intrinsic
semantics of a status update. In an intelligent reflecting surface-aided
cooperative relay communication system, we propose the age of semantics (AoS)
for measuring semantics freshness of the status updates. Specifically, we focus
on the status updating from a source node (SN) to the destination, which is
formulated as a Markov decision process (MDP). The objective of the SN is to
maximize the expected satisfaction of AoS and energy consumption under the
maximum transmit power constraint. To seek the optimal control policy, we first
derive an online deep actor-critic (DAC) learning scheme under the on-policy
temporal difference learning framework. However, implementing the online DAC in
practice poses the key challenge in infinitely repeated interactions between
the SN and the system, which can be dangerous particularly during the
exploration. We then put forward a novel offline DAC scheme, which estimates
the optimal control policy from a previously collected dataset without any
further interactions with the system. Numerical experiments verify the
theoretical results and show that our offline DAC scheme significantly
outperforms the online DAC scheme and the most representative baselines in
terms of mean utility, demonstrating strong robustness to dataset quality.Comment: This work has been submitted to the IEEE for possible publicatio