63 research outputs found
Recommended from our members
Abstractions in Reasoning for Long-Term Autonomy
The path to building adaptive, robust, intelligent agents has led researchers to develop a suite of powerful models and algorithms for agents with a single objective. However, in recent years, attempts to use this monolithic approach to solve an ever-expanding set of complex real-world problems, which increasingly include long-term autonomous deployments, have illuminated challenges in its ability to scale. Consequently, a fragmented collection of hierarchical and multi-objective models were developed. This trend continues into the algorithms as well, as each approximates an optimal solution in a different manner for scalability. These models and algorithms represent an attempt to solve pieces of an overarching problem: how can an agent explicitly model and integrate the necessary aspects of reasoning required to achieve long-term autonomy?
This thesis presents a general hierarchical and multi-objective model called a policy network that unifies prior fragmented solutions into a single graphical decision-making structure. Policy networks are broadly useful to solve numerous real-world problems. This thesis focuses on autonomous vehicle (AV) problems: (1) route-planning with multiple objectives; (2) semi-autonomy with proactive transfer of control; and (3) intersection decision-making for reasoning online about any number of other vehicles and pedestrians. Formal models are presented for each of the distinct problems. Solutions are evaluated using real-world map data in simulation and demonstrated on a fully operational AV prototype driving on real public roads. Policy networks serve as a shared underlying framework for all three, enabling their seamless integration as parts of an overall solution for rich, real-world, scalable decision-making in agents with long-term autonomy
Problems with Using Evolutionary Theory in Philosophy
Does science move toward truths? Are present scientific theories (approximately) true? Should we invoke truths to explain the success of science? Do our cognitive faculties track truths? Some philosophers say yes, while others say no, to these questions. Interestingly, both groups use the same scientific theory, viz., evolutionary theory, to defend their positions. I argue that it begs the question for the former group to do so because their positive answers imply that evolutionary theory is warranted, whereas it is self-defeating for the latter group to do so because their negative answers imply that evolutionary theory is unwarranted
Active teacher selection for reinforcement learning from human feedback
Reinforcement learning from human feedback (RLHF) enables machine learning
systems to learn objectives from human feedback. A core limitation of these
systems is their assumption that all feedback comes from a single human
teacher, despite querying a range of distinct teachers. We propose the Hidden
Utility Bandit (HUB) framework to model differences in teacher rationality,
expertise, and costliness, formalizing the problem of learning from multiple
teachers. We develop a variety of solution algorithms and apply them to two
real-world domains: paper recommendation systems and COVID-19 vaccine testing.
We find that the Active Teacher Selection (ATS) algorithm outperforms baseline
algorithms by actively selecting when and which teacher to query. The HUB
framework and ATS algorithm demonstrate the importance of leveraging
differences between teachers to learn accurate reward models, facilitating
future research on active teacher selection for robust reward modeling
Constrained Hierarchical Monte Carlo Belief-State Planning
Optimal plans in Constrained Partially Observable Markov Decision Processes
(CPOMDPs) maximize reward objectives while satisfying hard cost constraints,
generalizing safe planning under state and transition uncertainty.
Unfortunately, online CPOMDP planning is extremely difficult in large or
continuous problem domains. In many large robotic domains, hierarchical
decomposition can simplify planning by using tools for low-level control given
high-level action primitives (options). We introduce Constrained Options Belief
Tree Search (COBeTS) to leverage this hierarchy and scale online search-based
CPOMDP planning to large robotic problems. We show that if primitive option
controllers are defined to satisfy assigned constraint budgets, then COBeTS
will satisfy constraints anytime. Otherwise, COBeTS will guide the search
towards a safe sequence of option primitives, and hierarchical monitoring can
be used to achieve runtime safety. We demonstrate COBeTS in several
safety-critical, constrained partially observable robotic domains, showing that
it can plan successfully in continuous CPOMDPs while non-hierarchical baselines
cannot.Comment: Under review for the 2024 IEEE International Conference on Robotics
and Automation (ICRA
Decision Making in Non-Stationary Environments with Policy-Augmented Search
Sequential decision-making under uncertainty is present in many important
problems. Two popular approaches for tackling such problems are reinforcement
learning and online search (e.g., Monte Carlo tree search). While the former
learns a policy by interacting with the environment (typically done before
execution), the latter uses a generative model of the environment to sample
promising action trajectories at decision time. Decision-making is particularly
challenging in non-stationary environments, where the environment in which an
agent operates can change over time. Both approaches have shortcomings in such
settings -- on the one hand, policies learned before execution become stale
when the environment changes and relearning takes both time and computational
effort. Online search, on the other hand, can return sub-optimal actions when
there are limitations on allowed runtime. In this paper, we introduce
\textit{Policy-Augmented Monte Carlo tree search} (PA-MCTS), which combines
action-value estimates from an out-of-date policy with an online search using
an up-to-date model of the environment. We prove theoretical results showing
conditions under which PA-MCTS selects the one-step optimal action and also
bound the error accrued while following PA-MCTS as a policy. We compare and
contrast our approach with AlphaZero, another hybrid planning approach, and
Deep Q Learning on several OpenAI Gym environments. Through extensive
experiments, we show that under non-stationary settings with limited time
constraints, PA-MCTS outperforms these baselines.Comment: Extended Abstract accepted for presentation at AAMAS 202
Experience Filter: Using Past Experiences on Unseen Tasks or Environments
One of the bottlenecks of training autonomous vehicle (AV) agents is the
variability of training environments. Since learning optimal policies for
unseen environments is often very costly and requires substantial data
collection, it becomes computationally intractable to train the agent on every
possible environment or task the AV may encounter. This paper introduces a
zero-shot filtering approach to interpolate learned policies of past
experiences to generalize to unseen ones. We use an experience kernel to
correlate environments. These correlations are then exploited to produce
policies for new tasks or environments from learned policies. We demonstrate
our methods on an autonomous vehicle driving through T-intersections with
different characteristics, where its behavior is modeled as a partially
observable Markov decision process (POMDP). We first construct compact
representations of learned policies for POMDPs with unknown transition
functions given a dataset of sequential actions and observations. Then, we
filter parameterized policies of previously visited environments to generate
policies to new, unseen environments. We demonstrate our approaches on both an
actual AV and a high-fidelity simulator. Results indicate that our experience
filter offers a fast, low-effort, and near-optimal solution to create policies
for tasks or environments never seen before. Furthermore, the generated new
policies outperform the policy learned using the entire data collected from
past environments, suggesting that the correlation among different environments
can be exploited and irrelevant ones can be filtered out.Comment: Accepted at IEEE Intelligent Vehicles Symposium (IV) 202
- …