14 research outputs found

    Model-Based Reinforcement Learning Exploiting State-Action Equivalence

    Get PDF
    International audienceLeveraging an equivalence property in the state-space of a Markov Decision Process (MDP) has been investigated in several studies. This paper studies equivalence structure in the reinforcement learning (RL) setup, where transition distributions are no longer assumed to be known. We present a notion of similarity between transition probabilities of various state-action pairs of an MDP, which naturally defines an equivalence structure in the state-action space. We present equivalence-aware confidence sets for the case where the learner knows the underlying structure in advance. These sets are provably smaller than their corresponding equivalence-oblivious counterparts. In the more challenging case of an unknown equivalence structure, we present an algorithm called ApproxEquivalence that seeks to find an (approximate) equivalence structure, and define confidence sets using the approximate equivalence. To illustrate the efficacy of the presented confidence sets, we present C-UCRL, as a natural modification of UCRL2 for RL in undiscounted MDPs. In the case of a known equivalence structure, we show that C-UCRL improves over UCRL2 in terms of regret by a factor of SA/C, in any communicating MDP with S states, A actions, and C classes, which corresponds to a massive improvement when C SA. To the best of our knowledge, this is the first work providing regret bounds for RL when an equivalence structure in the MDP is efficiently exploited. In the case of an unknown equivalence structure, we show through numerical experiments that C-UCRL combined with ApproxEquivalence outperforms UCRL2 in ergodic MDPs

    Strategic Level Proton Therapy Patient Admission Planning: A Markov Decision Process Modeling Approach

    Get PDF
    A relatively new consideration in proton therapy planning is the requirement that the mix of patients treated from different categories satisfy desired mix percentages. Deviations from these percentages and their impacts on operational capabilities are of particular interest to healthcare planners. In this study, we investigate intelligent ways of admitting patients to a proton therapy facility that maximize the total expected number of treatment sessions (fractions) delivered to patients in a planning period with stochastic patient arrivals and penalize the deviation from the patient mix restrictions. We propose a Markov Decision Process (MDP) model that provides very useful insights in determining the best patient admission policies in the case of an unexpected opening in the facility (i.e., no-shows, appointment cancellations, etc.). In order to overcome the curse of dimensionality for larger and more realistic instances, we propose an aggregate MDP model that is able to approximate optimal patient admission policies using the worded weight aggregation technique. Our models are applicable to healthcare treatment facilities throughout the United States, but are motivated by collaboration with the University of Florida Proton Therapy Institute (UFPTI)

    Extending Mission Duration of UAS Multicopters: Multi-disciplinary Approach

    Get PDF
    Multicopters are important tools in industry, the military, and research but suffer from short flight times and mission durations. In this thesis, we discuss three different ways to increase flight times and therefore increase the viability of using multicopters in a variety of missions. Alternate fuel sources such as hydrogen fuel and solar cells are starting to be used on multicopters, in our research we simulate modern fuel cells and show how well they currently work as the power source for multicopters and how close they are to becoming useful in Unmanned Aircraft System (UAS) technology. Increasing the efficiency in which the available energy is used can also increase mission duration. Two characteristics that affect the efficiency of a mission are the flight speeds of the multicopter and the payload it carries. These characteristics are well known in larger rotorcrafts but often ignored in smaller multicopters. In our research, we explore the effect of flight speed on the dynamics of a multicopter and show that higher speeds lead to higher flight times due to the effect of translational lift. Lastly, we developed an online updating multi-flight planning algorithm for stop and charge missions, a method that can potentially indefinitely extend a mission. The multi-flight planning algorithm, the variable resolution horizon, reduces the computing resources necessary to 15% to 40% of a typical optimal planner while having a maximum 5.6% decrease in expected future reward, a metric for accuracy. The results of this thesis help guide decisions in fuel type for multicopter missions show examples of how to increase flight time through increasing efficiency and develop the framework for multi-flight missions. Advisers: Justin Bradley and Carrick Detweile

    Upper Confidence Reinforcement Learning exploiting state-action equivalence

    Get PDF
    Leveraging an equivalence property on the set of states of state-action pairs in anMarkov Decision Process (MDP) has been suggested by many authors. We takethe study of equivalence classes to the reinforcement learning (RL) setup, whentransition distributions are no longer assumed to be known, in a discrete MDP withaverage reward criterion and no reset. We study powerful similarities betweenstate-action pairs related to optimal transport. We first analyze a variant of theUCRL2 algorithm called C-UCRL2, which highlights the clear benefit of leveragingthis equivalence structure when it is known ahead of time: the regret bound scalesas ~O(D√KCT) where C is the number of classes of equivalent state-action pairsand K bounds the size of the support of the transitions. A non trivial question iswhether this benefit can still be observed when the structure is unknown and mustbe learned while minimizing the regret. We propose a sound clustering techniquethat provably learn the unknown classes, but show that its natural combination withUCRL2 empirically fails. Our findings suggests this is due to the ad-hoc criterionfor stopping the episodes in UCRL2. We replace it with hypothesis testing, whichin turns considerably improves all strategies. It is then empirically validated thatlearning the structure can be beneficial in a full-blown RL problem
    corecore