834 research outputs found
Average optimality for continuous-time Markov decision processes under weak continuity conditions
This article considers the average optimality for a continuous-time Markov
decision process with Borel state and action spaces and an arbitrarily
unbounded nonnegative cost rate. The existence of a deterministic stationary
optimal policy is proved under a different and general set of conditions as
compared to the previous literature; the controlled process can be explosive,
the transition rates can be arbitrarily unbounded and are weakly continuous,
the multifunction defining the admissible action spaces can be neither
compact-valued nor upper semi-continuous, and the cost rate is not necessarily
inf-compact
On Reward Structures of Markov Decision Processes
A Markov decision process can be parameterized by a transition kernel and a
reward function. Both play essential roles in the study of reinforcement
learning as evidenced by their presence in the Bellman equations. In our
inquiry of various kinds of "costs" associated with reinforcement learning
inspired by the demands in robotic applications, rewards are central to
understanding the structure of a Markov decision process and reward-centric
notions can elucidate important concepts in reinforcement learning.
Specifically, we study the sample complexity of policy evaluation and develop
a novel estimator with an instance-specific error bound of
for estimating a single state value. Under
the online regret minimization setting, we refine the transition-based MDP
constant, diameter, into a reward-based constant, maximum expected hitting
cost, and with it, provide a theoretical explanation for how a well-known
technique, potential-based reward shaping, could accelerate learning with
expert knowledge. In an attempt to study safe reinforcement learning, we model
hazardous environments with irrecoverability and proposed a quantitative notion
of safe learning via reset efficiency. In this setting, we modify a classic
algorithm to account for resets achieving promising preliminary numerical
results. Lastly, for MDPs with multiple reward functions, we develop a planning
algorithm that computationally efficiently finds Pareto-optimal stochastic
policies.Comment: This PhD thesis draws heavily from arXiv:1907.02114 and
arXiv:2002.06299; minor edit
A DYNAMIC MODEL FOR DETERMINING OPTIMAL RANGE IMPROVEMENT PROGRAMS
A Markov chain dynamic programming model is presented for determining optimal range improvement strategies as well as accompanying livestock production practices. The model specification focuses on the improved representation of rangeland dynamics and livestock response under alternative range conditions. The model is applied to range management decision making in the Cross Timbers Region of central Oklahoma. Results indicate that tebuthiuron treatments are economically feasible over the range of treatment costs evaluated. Optimal utilization of forage production following a treatment requires the conjunctive employment of prescribed burning and variable stocking rates over the treatmentÂ’s life.Land Economics/Use, Livestock Production/Industries,
Network formation by reinforcement learning: the long and medium run
We investigate a simple stochastic model of social network formation by the
process of reinforcement learning with discounting of the past. In the limit,
for any value of the discounting parameter, small, stable cliques are formed.
However, the time it takes to reach the limiting state in which cliques have
formed is very sensitive to the discounting parameter. Depending on this value,
the limiting result may or may not be a good predictor for realistic
observation times.Comment: 14 page
Age-Energy Tradeoff in Fading Channels with Packet-Based Transmissions
The optimal transmission strategy to minimize the weighted combination of age
of information (AoI) and total energy consumption is studied in this paper. It
is assumed that the status update information is obtained and transmitted at
fixed rate over a Rayleigh fading channel in a packet-based wireless
communication system. A maximum transmission round on each packet is enforced
to guarantee certain reliability of the update packets. Given fixed average
transmission power, the age-energy tradeoff can be formulated as a constrained
Markov decision process (CMDP) problem considering the sensing power
consumption as well. Employing the Lagrangian relaxation, the CMDP problem is
transformed into a Markov decision process (MDP) problem. An algorithm is
proposed to obtain the optimal power allocation policy. Through simulation
results, it is shown that both age and energy efficiency can be improved by the
proposed optimal policy compared with two benchmark schemes. Also, age can be
effectively reduced at the expense of higher energy cost, and more emphasis on
energy consumption leads to higher average age at the same energy efficiency.
Overall, the tradeoff between average age and energy efficiency is identified
- …