891 research outputs found
Planning to Learn: A Novel Algorithm for Active Learning during Model-Based Planning
Active Inference is a recent framework for modeling planning under
uncertainty. Empirical and theoretical work have now begun to evaluate the
strengths and weaknesses of this approach and how it might be improved. A
recent extension - the sophisticated inference (SI) algorithm - improves
performance on multi-step planning problems through recursive decision tree
search. However, little work to date has been done to compare SI to other
established planning algorithms. SI was also developed with a focus on
inference as opposed to learning. The present paper has two aims. First, we
compare performance of SI to Bayesian reinforcement learning (RL) schemes
designed to solve similar problems. Second, we present an extension of SI -
sophisticated learning (SL) - that more fully incorporates active learning
during planning. SL maintains beliefs about how model parameters would change
under the future observations expected under each policy. This allows a form of
counterfactual retrospective inference in which the agent considers what could
be learned from current or past observations given different future
observations. To accomplish these aims, we make use of a novel, biologically
inspired environment designed to highlight the problem structure for which SL
offers a unique solution. Here, an agent must continually search for available
(but changing) resources in the presence of competing affordances for
information gain. Our simulations show that SL outperforms all other algorithms
in this context - most notably, Bayes-adaptive RL and upper confidence bound
algorithms, which aim to solve multi-step planning problems using similar
principles (i.e., directed exploration and counterfactual reasoning). These
results provide added support for the utility of Active Inference in solving
this class of biologically-relevant problems and offer added tools for testing
hypotheses about human cognition.Comment: 31 pages, 5 figure
Mean-field games of speedy information access with observation costs
We investigate a mean-field game (MFG) in which agents can exercise control
actions that affect their speed of access to information. The agents can
dynamically decide to receive observations with less delay by paying higher
observation costs. Agents seek to exploit their active information gathering by
making further decisions to influence their state dynamics to maximize rewards.
In the mean field equilibrium, each generic agent solves individually a
partially observed Markov decision problem in which the way partial
observations are obtained is itself also subject of dynamic control actions by
the agent. Based on a finite characterisation of the agents' belief states, we
show how the mean field game with controlled costly information access can be
formulated as an equivalent standard mean field game on a suitably augmented
but finite state space.We prove that with sufficient entropy regularisation, a
fixed point iteration converges to the unique MFG equilibrium and yields an
approximate -Nash equilibrium for a large but finite population size.
We illustrate our MFG by an example from epidemiology, where medical testing
results at different speeds and costs can be chosen by the agents.Comment: 33 pages, 4 figure
CAR-DESPOT: Causally-Informed Online POMDP Planning for Robots in Confounded Environments
Robots operating in real-world environments must reason about possible
outcomes of stochastic actions and make decisions based on partial observations
of the true world state. A major challenge for making accurate and robust
action predictions is the problem of confounding, which if left untreated can
lead to prediction errors. The partially observable Markov decision process
(POMDP) is a widely-used framework to model these stochastic and
partially-observable decision-making problems. However, due to a lack of
explicit causal semantics, POMDP planning methods are prone to confounding bias
and thus in the presence of unobserved confounders may produce underperforming
policies. This paper presents a novel causally-informed extension of "anytime
regularized determinized sparse partially observable tree" (AR-DESPOT), a
modern anytime online POMDP planner, using causal modelling and inference to
eliminate errors caused by unmeasured confounder variables. We further propose
a method to learn offline the partial parameterisation of the causal model for
planning, from ground truth model data. We evaluate our methods on a toy
problem with an unobserved confounder and show that the learned causal model is
highly accurate, while our planning method is more robust to confounding and
produces overall higher performing policies than AR-DESPOT.Comment: 8 pages, 3 figures, submitted to 2023 IEEE/RSJ International
Conference on Intelligent Robots and Systems (IROS
Reinforcement learning in large state action spaces
Reinforcement learning (RL) is a promising framework for training intelligent agents which learn to optimize long term utility by directly interacting with the environment. Creating RL methods which scale to large state-action spaces is a critical problem towards ensuring real world deployment of RL systems. However, several challenges limit the applicability of RL to large scale settings. These include difficulties with exploration, low sample efficiency, computational intractability, task constraints like decentralization and lack of guarantees about important properties like performance, generalization and robustness in potentially unseen scenarios.
This thesis is motivated towards bridging the aforementioned gap. We propose several principled algorithms and frameworks for studying and addressing the above challenges RL. The proposed methods cover a wide range of RL settings (single and multi-agent systems (MAS) with all the variations in the latter, prediction and control, model-based and model-free methods, value-based and policy-based methods). In this work we propose the first results on several different problems: e.g. tensorization of the Bellman equation which allows exponential sample efficiency gains (Chapter 4), provable suboptimality arising from structural constraints in MAS(Chapter 3), combinatorial generalization results in cooperative MAS(Chapter 5), generalization results on observation shifts(Chapter 7), learning deterministic policies in a probabilistic RL framework(Chapter 6). Our algorithms exhibit provably enhanced performance and sample efficiency along with better scalability. Additionally, we also shed light on generalization aspects of the agents under different frameworks. These properties have been been driven by the use of several advanced tools (e.g. statistical machine learning, state abstraction, variational inference, tensor theory).
In summary, the contributions in this thesis significantly advance progress towards making RL agents ready for large scale, real world applications
Investigation of risk-aware MDP and POMDP contingency management autonomy for UAS
Unmanned aircraft systems (UAS) are being increasingly adopted for various
applications. The risk UAS poses to people and property must be kept to
acceptable levels. This paper proposes risk-aware contingency management
autonomy to prevent an accident in the event of component malfunction,
specifically propulsion unit failure and/or battery degradation. The proposed
autonomy is modeled as a Markov Decision Process (MDP) whose solution is a
contingency management policy that appropriately executes emergency landing,
flight termination or continuation of planned flight actions. Motivated by the
potential for errors in fault/failure indicators, partial observability of the
MDP state space is investigated. The performance of optimal policies is
analyzed over varying observability conditions in a high-fidelity simulator.
Results indicate that both partially observable MDP (POMDP) and maximum a
posteriori MDP policies performed similarly over different state observability
criteria, given the nearly deterministic state transition model
Optimal Status Updates for Minimizing Age of Correlated Information in IoT Networks with Energy Harvesting Sensors
Many real-time applications of the Internet of Things (IoT) need to deal with
correlated information generated by multiple sensors. The design of efficient
status update strategies that minimize the Age of Correlated Information (AoCI)
is a key factor. In this paper, we consider an IoT network consisting of
sensors equipped with the energy harvesting (EH) capability. We optimize the
average AoCI at the data fusion center (DFC) by appropriately managing the
energy harvested by sensors, whose true battery states are unobservable during
the decision-making process. Particularly, we first formulate the dynamic
status update procedure as a partially observable Markov decision process
(POMDP), where the environmental dynamics are unknown to the DFC. In order to
address the challenges arising from the causality of energy usage, unknown
environmental dynamics, unobservability of sensors'true battery states, and
large-scale discrete action space, we devise a deep reinforcement learning
(DRL)-based dynamic status update algorithm. The algorithm leverages the
advantages of the soft actor-critic and long short-term memory techniques.
Meanwhile, it incorporates our proposed action decomposition and mapping
mechanism. Extensive simulations are conducted to validate the effectiveness of
our proposed algorithm by comparing it with available DRL algorithms for
POMDPs
A Review of Symbolic, Subsymbolic and Hybrid Methods for Sequential Decision Making
The field of Sequential Decision Making (SDM) provides tools for solving
Sequential Decision Processes (SDPs), where an agent must make a series of
decisions in order to complete a task or achieve a goal. Historically, two
competing SDM paradigms have view for supremacy. Automated Planning (AP)
proposes to solve SDPs by performing a reasoning process over a model of the
world, often represented symbolically. Conversely, Reinforcement Learning (RL)
proposes to learn the solution of the SDP from data, without a world model, and
represent the learned knowledge subsymbolically. In the spirit of
reconciliation, we provide a review of symbolic, subsymbolic and hybrid methods
for SDM. We cover both methods for solving SDPs (e.g., AP, RL and techniques
that learn to plan) and for learning aspects of their structure (e.g., world
models, state invariants and landmarks). To the best of our knowledge, no other
review in the field provides the same scope. As an additional contribution, we
discuss what properties an ideal method for SDM should exhibit and argue that
neurosymbolic AI is the current approach which most closely resembles this
ideal method. Finally, we outline several proposals to advance the field of SDM
via the integration of symbolic and subsymbolic AI
Efficient RL with Impaired Observability: Learning to Act with Delayed and Missing State Observations
In real-world reinforcement learning (RL) systems, various forms of impaired
observability can complicate matters. These situations arise when an agent is
unable to observe the most recent state of the system due to latency or lossy
channels, yet the agent must still make real-time decisions. This paper
introduces a theoretical investigation into efficient RL in control systems
where agents must act with delayed and missing state observations. We establish
near-optimal regret bounds, of the form , for RL in both the delayed and missing observation settings.
Despite impaired observability posing significant challenges to the policy
class and planning, our results demonstrate that learning remains efficient,
with the regret bound optimally depending on the state-action size of the
original system. Additionally, we provide a characterization of the performance
of the optimal policy under impaired observability, comparing it to the optimal
value obtained with full observability
Scaling energy management in buildings with artificial intelligence
L'abstract è presente nell'allegato / the abstract is in the attachmen
- …