1,102 research outputs found
Regret Bounds for Reinforcement Learning with Policy Advice
In some reinforcement learning problems an agent may be provided with a set
of input policies, perhaps learned from prior experience or provided by
advisors. We present a reinforcement learning with policy advice (RLPA)
algorithm which leverages this input set and learns to use the best policy in
the set for the reinforcement learning task at hand. We prove that RLPA has a
sub-linear regret of \tilde O(\sqrt{T}) relative to the best input policy, and
that both this regret and its computational complexity are independent of the
size of the state and action space. Our empirical simulations support our
theoretical analysis. This suggests RLPA may offer significant advantages in
large domains where some prior good policies are provided
Experimental analysis of sample-based maps for long-term SLAM
This paper presents a system for long-term SLAM (simultaneous localization and mapping) by mobile service robots and its experimental evaluation in a real dynamic environment. To deal with the stability-plasticity dilemma (the trade-off between adaptation to new patterns and preservation of old patterns), the environment is represented at multiple timescales simultaneously (5 in our experiments). A sample-based representation is
proposed, where older memories fade at different rates depending on the timescale, and robust statistics are used to interpret the samples. The dynamics of this representation are analysed in a five week experiment, measuring the relative influence of short- and long-term memories over time, and further demonstrating the robustness of the approach
Probabilistic Inference for Fast Learning in Control
We provide a novel framework for very fast model-based reinforcement learning in continuous state and action spaces. The framework requires probabilistic models that explicitly characterize their levels of confidence. Within this framework, we use flexible, non-parametric models to describe the world based on previously collected experience. We demonstrate learning on the cart-pole problem in a setting where we provide very limited prior knowledge about the task. Learning progresses rapidly, and a good policy is found after only a hand-full of iterations
Fermionic Molecular Dynamics for nuclear dynamics and thermodynamics
A new Fermionic Molecular Dynamics (FMD) model based on a Skyrme functional
is proposed in this paper. After introducing the basic formalism, some first
applications to nuclear structure and nuclear thermodynamics are presentedComment: 5 pages, Proceedings of the French-Japanese Symposium, September
2008. To be published in Int. J. of Mod. Phys.
Actor-Critic Policy Learning in Cooperative Planning
In this paper, we introduce a method for learning and adapting cooperative control strategies in real-time stochastic domains. Our framework is an instance of the intelligent cooperative control architecture (iCCA)[superscript 1]. The agent starts by following the "safe" plan calculated by the planning module and incrementally adapting its policy to maximize the cumulative rewards. Actor-critic and consensus-based bundle algorithm (CBBA) were employed as the building blocks of the iCCA framework. We demonstrate the performance of our approach by simulating limited fuel unmanned aerial vehicles aiming for stochastic targets. In one experiment where the optimal solution can be calculated, the integrated framework boosted the optimality of the solution by an average of %10, when compared to running each of the modules individually, while keeping the computational load within the requirements for real-time implementation.Boeing Scientific Research LaboratoriesUnited States. Air Force Office of Scientific Research (Grant FA9550-08-1-0086
Using artificial intelligence techniques for strategy generation in the Commons game
In this paper, we consider the use of artificial intelligence techniques to aid in discovery of winning strategies for the Commons Game (CG). The game represents a common scenario in which multiple parties share the use of a self-replenishing resource. The resource deteriorates quickly if used indiscriminately. If used responsibly, however, the resource thrives. We consider the scenario one player uses hill climbing or particle swarm optimization to select the course of action, while the remaining N − 1 players use a fixed probability vector. We show that hill climbing and particle swarm optimization consistently generate winning strategies
What works? Interventions to reduce readmission after hip fracture: A rapid review of systematic reviews
Background: Hip fracture is a common serious injury in older people and reducing readmission after hip fracture is a priority in many healthcare systems. Interventions which significantly reduce readmission after hip fracture have been identified and the aim of this review is to collate and summarise the efficacy of these interventions in one place. Methods: In a rapid review of systematic reviews one reviewer (ELS) searched the Ovid SP version of Medline and the Cochrane Database of Systematic Reviews. Titles and abstracts of 915 articles were reviewed. Nineteen systematic reviews were included. (ELS) used a data extraction sheet to capture data on interventions and their effect on readmission. A second reviewer (RK) verified data extraction in a random sample of four systematic reviews. Results were not meta-analysed. Odds and risk ratios are presented where available. Results: Three interventions significantly reduce readmission in elderly populations after hip fracture: personalised discharge planning, self-care and regional anaesthesia. Three interventions are not conclusively supported by evidence: Oral Nutritional Supplementation, integration of care, and case management. Two interventions do not affect readmission after hip fracture: Enhanced Recovery pathways and comprehensive geriatric assessment. Conclusions: Three interventions are most effective at reducing readmissions in older people: discharge planning, self-care, and regional anaesthesia. Further work is needed to optimise interventions and ensure the most at-risk populations benefit from them, and complete development work on interventions (e.g. interventions to reduce loneliness) and intervention components (e.g. adapting self-care interventions for dementia patients) which have not been fully tested yet.</p
Recommended from our members
SMART (Stochastic Model Acquisition with ReinforcemenT) learning agents: A preliminary report
We present a framework for building agents that learn using SMART, a system that combines stochastic model acquisition with reinforcement learning to enable an agent to model its environment through experience and subsequently form action selection policies using the acquired model. We extend an existing algorithm for automatic creation of stochastic strips operators [9] as a preliminary method of environment modelling. We then define the process of generation of future states using these operators and an initial state and finally show the process by which the agent can use the generated states to form a policy with a standard reinforcement learning algorithm. The potential of SMART is exemplified using the well-known predator prey scenario. Results of applying SMART to this environment and directions for future work are discussed
Bayesian Nonparametric Inverse Reinforcement Learning
Inverse reinforcement learning (IRL) is the task of learning the reward function of a Markov Decision Process (MDP) given the transition function and a set of observed demonstrations in the form of state-action pairs. Current IRL algorithms attempt to find a single reward function which explains the entire observation set. In practice, this leads to a computationally-costly search over a large (typically infinite) space of complex reward functions. This paper proposes the notion that if the observations can be partitioned into smaller groups, a class of much simpler reward functions can be used to explain each group. The proposed method uses a Bayesian nonparametric mixture model to automatically partition the data and find a set of simple reward functions corresponding to each partition. The simple rewards are interpreted intuitively as subgoals, which can be used to predict actions or analyze which states are important to the demonstrator. Experimental results are given for simple examples showing comparable performance to other IRL algorithms in nominal situations. Moreover, the proposed method handles cyclic tasks (where the agent begins and ends in the same state) that would break existing algorithms without modification. Finally, the new algorithm has a fundamentally different structure than previous methods, making it more computationally efficient in a real-world learning scenario where the state space is large but the demonstration set is small
Solving Non-Stationary Bandit Problems by Random Sampling from Sibling Kalman Filters
The multi-armed bandit problem is a classical optimization problem where an agent sequentially pulls one of multiple arms attached to a gambling machine, with each pull resulting in a random reward. The reward distributions are unknown, and thus, one must balance between exploiting existing knowledge about the arms, and obtaining new information. Dynamically changing (non-stationary) bandit problems are particularly challenging because each change of the reward distributions may progressively degrade the performance of any fixed strategy. Although computationally intractable in many cases, Bayesian methods provide a standard for optimal decision making. This paper proposes a novel solution scheme for bandit problems with non-stationary normally distributed rewards. The scheme is inherently Bayesian in nature, yet avoids computational intractability by relying simply on updating the hyper parameters of sibling Kalman Filters, and on random sampling from these posteriors. Furthermore, it is able to track the better actions, thus supporting non-stationary bandit problems. Extensive experiments demonstrate that our scheme outperforms recently proposed bandit playing algorithms, not only in non-stationary environments, but in stationary environments also. Furthermore, our scheme is robust to inexact parameter settings. We thus believe that our methodology opens avenues for obtaining improved novel solutions
- …
