69,748 research outputs found

    State-Augmentation Transformations for Risk-Sensitive Markov Decision Processes

    Get PDF
    Markov decision processes (MDPs) provide a mathematical framework for modeling sequential decision making (SDM) where system evolution and reward are partly under the control of a decision maker and partly random. MDPs have been widely adopted in numerous fields, such as finance, robotics, manufacturing, and control systems. For stochastic control problems, MDPs serve as the underlying models in dynamic programming and reinforcement learning (RL) algorithms. In this thesis, we study risk estimation in MDPs, where the variability of random rewards is taken into account. First, we categorize the reward into four classes: deterministic/stochastic and state-/transition-based. Though numerous of theoretical methods are designed for MDPs or Markov processes with a deterministic (and state-based) reward, many practical problems are naturally modeled by processes with stochastic (and transition-based) reward. When the optimality criterion refers to the risk-neutral expectation of a (discount) total reward, we can use a model (reward) simplification to bridge the gap. However, when the criterion is risk-sensitive, a model simplification will change the risk value. For preserving the risks, we address that most, if not all, the inherent risk measures depend on the reward sequence (Rt). In order to bridge the gap between theoretical methods and practical problems with respect to risk-sensitive criteria, we propose a state-augmentation transformation (SAT). Four cases are thoroughly studied in which different forms of SAT should be implemented for risk preservation. In numerical experiments, we compare the results from the model simplifications and the SAT, and illustrate that, i). the model simplifications change (Rt) as well as return (or total reward) distributions; and ii). the proposed SAT transforms processes with complicated rewards, such as stochastic and transition-based rewards, into ones with deterministic state-based rewards, with intact (Rt). Second, we consider constrained risk-sensitive SDM problems in dynamic environments. Unlike other studies, we simultaneously consider the three factors—constraint, risk, and dynamic environment. We propose a scheme to generate a synthetic dataset for training an approximator. The reasons for not using historical data are two-fold. The first reason refers to information incompleteness. Historical data usually contains no information on criterion parameters (which risk objective and constraint(s) are concerned) and (or) the optimal policy (usually just an action for each item of data), and in many cases, even the information on environmental parameters (such as all involved costs) is incomplete. The second reason is about optimality. The decision makers might prefer an easy-to-use policy than an optimal one, which is hard to determine whether the preferred policy is optimal (such as an EOQ policy), since the practical problems could be different from the theoretical model diversely and subtly. Therefore, we propose to evaluate or estimate risk measures with RL methods and train an approximator, such as neural network, with a synthetic dataset. A numerical experiment validates the proposed scheme. The contributions of this study are three-fold. First, for risk evaluation in different cases, we propose the SAT theorem and corollaries to enable theoretical methods to solve practical problems with a preserved (Rt). Second, we estimate three risk measures with return variance as examples to illustrate the difference between the results from the SAT and the model simplification. Third, we present a scheme for constrained, risk-sensitive SDM problems in a dynamic environment with an inventory control example

    Spectrum sharing models in cognitive radio networks

    Get PDF
    Spectrum scarcity demands thinking new ways to manage the distribution of radio frequency bands so that its use is more effective. The emerging technology that can enable this paradigm shift is the cognitive radio. Different models for organizing and managing cognitive radios have emerged, all with specific strategic purposes. In this article we review the allocation spectrum patterns of cognitive radio networks and analyse which are the common basis of each model.We expose the vulnerabilities and open challenges that still threaten the adoption and exploitation of cognitive radios for open civil networks.L'escassetat de demandes d'espectre fan pensar en noves formes de gestionar la distribució de les bandes de freqüència de ràdio perquè el seu ús sigui més efectiu. La tecnologia emergent que pot permetre aquest canvi de paradigma és la ràdio cognitiva. Han sorgit diferents models d'organització i gestió de les ràdios cognitives, tots amb determinats fins estratègics. En aquest article es revisen els patrons d'assignació de l'espectre de les xarxes de ràdio cognitiva i s'analitzen quals són la base comuna de cada model. S'exposen les vulnerabilitats i els desafiaments oberts que segueixen amenaçant l'adopció i l'explotació de les ràdios cognitives per obrir les xarxes civils.La escasez de demandas de espectro hacen pensar en nuevas formas de gestionar la distribución de las bandas de frecuencia de radio para que su uso sea más efectivo. La tecnología emergente que puede permitir este cambio de paradigma es la radio cognitiva. Han surgido diferentes modelos de organización y gestión de las radios cognitivas, todos con determinados fines estratégicos. En este artículo se revisan los patrones de asignación del espectro de las redes de radio cognitiva y se analizan cuales son la base común de cada modelo. Se exponen las vulnerabilidades y los desafíos abiertos que siguen amenazando la adopción y la explotación de las radios cognitivas para abrir las redes civiles

    Planning with Information-Processing Constraints and Model Uncertainty in Markov Decision Processes

    Full text link
    Information-theoretic principles for learning and acting have been proposed to solve particular classes of Markov Decision Problems. Mathematically, such approaches are governed by a variational free energy principle and allow solving MDP planning problems with information-processing constraints expressed in terms of a Kullback-Leibler divergence with respect to a reference distribution. Here we consider a generalization of such MDP planners by taking model uncertainty into account. As model uncertainty can also be formalized as an information-processing constraint, we can derive a unified solution from a single generalized variational principle. We provide a generalized value iteration scheme together with a convergence proof. As limit cases, this generalized scheme includes standard value iteration with a known model, Bayesian MDP planning, and robust planning. We demonstrate the benefits of this approach in a grid world simulation.Comment: 16 pages, 3 figure
    corecore