8,365 research outputs found
Learning non-Markovian Decision-Making from State-only Sequences
Conventional imitation learning assumes access to the actions of
demonstrators, but these motor signals are often non-observable in naturalistic
settings. Additionally, sequential decision-making behaviors in these settings
can deviate from the assumptions of a standard Markov Decision Process (MDP).
To address these challenges, we explore deep generative modeling of state-only
sequences with non-Markov Decision Process (nMDP), where the policy is an
energy-based prior in the latent space of the state transition generator. We
develop maximum likelihood estimation to achieve model-based imitation, which
involves short-run MCMC sampling from the prior and importance sampling for the
posterior. The learned model enables \textit{decision-making as inference}:
model-free policy execution is equivalent to prior sampling, model-based
planning is posterior sampling initialized from the policy. We demonstrate the
efficacy of the proposed method in a prototypical path planning task with
non-Markovian constraints and show that the learned model exhibits strong
performances in challenging domains from the MuJoCo suite
On information captured by neural networks: connections with memorization and generalization
Despite the popularity and success of deep learning, there is limited
understanding of when, how, and why neural networks generalize to unseen
examples. Since learning can be seen as extracting information from data, we
formally study information captured by neural networks during training.
Specifically, we start with viewing learning in presence of noisy labels from
an information-theoretic perspective and derive a learning algorithm that
limits label noise information in weights. We then define a notion of unique
information that an individual sample provides to the training of a deep
network, shedding some light on the behavior of neural networks on examples
that are atypical, ambiguous, or belong to underrepresented subpopulations. We
relate example informativeness to generalization by deriving nonvacuous
generalization gap bounds. Finally, by studying knowledge distillation, we
highlight the important role of data and label complexity in generalization.
Overall, our findings contribute to a deeper understanding of the mechanisms
underlying neural network generalization.Comment: PhD thesi
Policy Space Diversity for Non-Transitive Games
Policy-Space Response Oracles (PSRO) is an influential algorithm framework
for approximating a Nash Equilibrium (NE) in multi-agent non-transitive games.
Many previous studies have been trying to promote policy diversity in PSRO. A
major weakness in existing diversity metrics is that a more diverse (according
to their diversity metrics) population does not necessarily mean (as we proved
in the paper) a better approximation to a NE. To alleviate this problem, we
propose a new diversity metric, the improvement of which guarantees a better
approximation to a NE. Meanwhile, we develop a practical and well-justified
method to optimize our diversity metric using only state-action samples. By
incorporating our diversity regularization into the best response solving in
PSRO, we obtain a new PSRO variant, Policy Space Diversity PSRO (PSD-PSRO). We
present the convergence property of PSD-PSRO. Empirically, extensive
experiments on various games demonstrate that PSD-PSRO is more effective in
producing significantly less exploitable policies than state-of-the-art PSRO
variants
AI-Generated Incentive Mechanism and Full-Duplex Semantic Communications for Information Sharing
The next generation of Internet services, such as Metaverse, rely on mixed
reality (MR) technology to provide immersive user experiences. However, the
limited computation power of MR headset-mounted devices (HMDs) hinders the
deployment of such services. Therefore, we propose an efficient information
sharing scheme based on full-duplex device-to-device (D2D) semantic
communications to address this issue. Our approach enables users to avoid heavy
and repetitive computational tasks, such as artificial intelligence-generated
content (AIGC) in the view images of all MR users. Specifically, a user can
transmit the generated content and semantic information extracted from their
view image to nearby users, who can then use this information to obtain the
spatial matching of computation results under their view images. We analyze the
performance of full-duplex D2D communications, including the achievable rate
and bit error probability, by using generalized small-scale fading models. To
facilitate semantic information sharing among users, we design a contract
theoretic AI-generated incentive mechanism. The proposed diffusion model
generates the optimal contract design, outperforming two deep reinforcement
learning algorithms, i.e., proximal policy optimization and soft actor-critic
algorithms. Our numerical analysis experiment proves the effectiveness of our
proposed methods. The code for this paper is available at
https://github.com/HongyangDu/SemSharingComment: Accepted by IEEE JSA
Active Coverage for PAC Reinforcement Learning
Collecting and leveraging data with good coverage properties plays a crucial
role in different aspects of reinforcement learning (RL), including reward-free
exploration and offline learning. However, the notion of "good coverage" really
depends on the application at hand, as data suitable for one context may not be
so for another. In this paper, we formalize the problem of active coverage in
episodic Markov decision processes (MDPs), where the goal is to interact with
the environment so as to fulfill given sampling requirements. This framework is
sufficiently flexible to specify any desired coverage property, making it
applicable to any problem that involves online exploration. Our main
contribution is an instance-dependent lower bound on the sample complexity of
active coverage and a simple game-theoretic algorithm, CovGame, that nearly
matches it. We then show that CovGame can be used as a building block to solve
different PAC RL tasks. In particular, we obtain a simple algorithm for PAC
reward-free exploration with an instance-dependent sample complexity that, in
certain MDPs which are "easy to explore", is lower than the minimax one. By
further coupling this exploration algorithm with a new technique to do implicit
eliminations in policy space, we obtain a computationally-efficient algorithm
for best-policy identification whose instance-dependent sample complexity
scales with gaps between policy values.Comment: Accepted at COLT 202
CAR-DESPOT: Causally-Informed Online POMDP Planning for Robots in Confounded Environments
Robots operating in real-world environments must reason about possible
outcomes of stochastic actions and make decisions based on partial observations
of the true world state. A major challenge for making accurate and robust
action predictions is the problem of confounding, which if left untreated can
lead to prediction errors. The partially observable Markov decision process
(POMDP) is a widely-used framework to model these stochastic and
partially-observable decision-making problems. However, due to a lack of
explicit causal semantics, POMDP planning methods are prone to confounding bias
and thus in the presence of unobserved confounders may produce underperforming
policies. This paper presents a novel causally-informed extension of "anytime
regularized determinized sparse partially observable tree" (AR-DESPOT), a
modern anytime online POMDP planner, using causal modelling and inference to
eliminate errors caused by unmeasured confounder variables. We further propose
a method to learn offline the partial parameterisation of the causal model for
planning, from ground truth model data. We evaluate our methods on a toy
problem with an unobserved confounder and show that the learned causal model is
highly accurate, while our planning method is more robust to confounding and
produces overall higher performing policies than AR-DESPOT.Comment: 8 pages, 3 figures, submitted to 2023 IEEE/RSJ International
Conference on Intelligent Robots and Systems (IROS
Segment Anything Model (SAM) for Radiation Oncology
In this study, we evaluate the performance of the Segment Anything Model
(SAM) model in clinical radiotherapy. We collected real clinical cases from
four regions at the Mayo Clinic: prostate, lung, gastrointestinal, and head \&
neck, which are typical treatment sites in radiation oncology. For each case,
we selected the OARs of concern in radiotherapy planning and compared the Dice
and Jaccard outcomes between clinical manual delineation, automatic
segmentation using SAM's "segment anything" mode, and automatic segmentation
using SAM with box prompt. Our results indicate that SAM performs better in
automatic segmentation for the prostate and lung regions, while its performance
in the gastrointestinal and head \& neck regions was relatively inferior. When
considering the size of the organ and the clarity of its boundary, SAM displays
better performance for larger organs with clear boundaries, such as the lung
and liver, and worse for smaller organs with unclear boundaries, like the
parotid and cochlea. These findings align with the generally accepted
variations in difficulty level associated with manual delineation of different
organs at different sites in clinical radiotherapy. Given that SAM, a single
trained model, could handle the delineation of OARs in four regions, these
results also demonstrate SAM's robust generalization capabilities in automatic
segmentation for radiotherapy, i.e., achieving delineation of different
radiotherapy OARs using a generic automatic segmentation model. SAM's
generalization capabilities across different regions make it technically
feasible to develop a generic model for automatic segmentation in radiotherapy
Harnessing Mixed Offline Reinforcement Learning Datasets via Trajectory Weighting
Most offline reinforcement learning (RL) algorithms return a target policy
maximizing a trade-off between (1) the expected performance gain over the
behavior policy that collected the dataset, and (2) the risk stemming from the
out-of-distribution-ness of the induced state-action occupancy. It follows that
the performance of the target policy is strongly related to the performance of
the behavior policy and, thus, the trajectory return distribution of the
dataset. We show that in mixed datasets consisting of mostly low-return
trajectories and minor high-return trajectories, state-of-the-art offline RL
algorithms are overly restrained by low-return trajectories and fail to exploit
high-performing trajectories to the fullest. To overcome this issue, we show
that, in deterministic MDPs with stochastic initial states, the dataset
sampling can be re-weighted to induce an artificial dataset whose behavior
policy has a higher return. This re-weighted sampling strategy may be combined
with any offline RL algorithm. We further analyze that the opportunity for
performance improvement over the behavior policy correlates with the
positive-sided variance of the returns of the trajectories in the dataset. We
empirically show that while CQL, IQL, and TD3+BC achieve only a part of this
potential policy improvement, these same algorithms combined with our
reweighted sampling strategy fully exploit the dataset. Furthermore, we
empirically demonstrate that, despite its theoretical limitation, the approach
may still be efficient in stochastic environments. The code is available at
https://github.com/Improbable-AI/harness-offline-rl
Emergence of Adaptive Circadian Rhythms in Deep Reinforcement Learning
Adapting to regularities of the environment is critical for biological
organisms to anticipate events and plan. A prominent example is the circadian
rhythm corresponding to the internalization by organisms of the -hour
period of the Earth's rotation. In this work, we study the emergence of
circadian-like rhythms in deep reinforcement learning agents. In particular, we
deployed agents in an environment with a reliable periodic variation while
solving a foraging task. We systematically characterize the agent's behavior
during learning and demonstrate the emergence of a rhythm that is endogenous
and entrainable. Interestingly, the internal rhythm adapts to shifts in the
phase of the environmental signal without any re-training. Furthermore, we show
via bifurcation and phase response curve analyses how artificial neurons
develop dynamics to support the internalization of the environmental rhythm.
From a dynamical systems view, we demonstrate that the adaptation proceeds by
the emergence of a stable periodic orbit in the neuron dynamics with a phase
response that allows an optimal phase synchronisation between the agent's
dynamics and the environmental rhythm.Comment: ICML 202
Reinforcement learning in large state action spaces
Reinforcement learning (RL) is a promising framework for training intelligent agents which learn to optimize long term utility by directly interacting with the environment. Creating RL methods which scale to large state-action spaces is a critical problem towards ensuring real world deployment of RL systems. However, several challenges limit the applicability of RL to large scale settings. These include difficulties with exploration, low sample efficiency, computational intractability, task constraints like decentralization and lack of guarantees about important properties like performance, generalization and robustness in potentially unseen scenarios.
This thesis is motivated towards bridging the aforementioned gap. We propose several principled algorithms and frameworks for studying and addressing the above challenges RL. The proposed methods cover a wide range of RL settings (single and multi-agent systems (MAS) with all the variations in the latter, prediction and control, model-based and model-free methods, value-based and policy-based methods). In this work we propose the first results on several different problems: e.g. tensorization of the Bellman equation which allows exponential sample efficiency gains (Chapter 4), provable suboptimality arising from structural constraints in MAS(Chapter 3), combinatorial generalization results in cooperative MAS(Chapter 5), generalization results on observation shifts(Chapter 7), learning deterministic policies in a probabilistic RL framework(Chapter 6). Our algorithms exhibit provably enhanced performance and sample efficiency along with better scalability. Additionally, we also shed light on generalization aspects of the agents under different frameworks. These properties have been been driven by the use of several advanced tools (e.g. statistical machine learning, state abstraction, variational inference, tensor theory).
In summary, the contributions in this thesis significantly advance progress towards making RL agents ready for large scale, real world applications
- …