135,870 research outputs found
Integrated Ray-Tracing and Coverage Planning Control using Reinforcement Learning
In this work we propose a coverage planning control approach which allows a
mobile agent, equipped with a controllable sensor (i.e., a camera) with limited
sensing domain (i.e., finite sensing range and angle of view), to cover the
surface area of an object of interest. The proposed approach integrates
ray-tracing into the coverage planning process, thus allowing the agent to
identify which parts of the scene are visible at any point in time. The problem
of integrated ray-tracing and coverage planning control is first formulated as
a constrained optimal control problem (OCP), which aims at determining the
agent's optimal control inputs over a finite planning horizon, that minimize
the coverage time. Efficiently solving the resulting OCP is however very
challenging due to non-convex and non-linear visibility constraints. To
overcome this limitation, the problem is converted into a Markov decision
process (MDP) which is then solved using reinforcement learning. In particular,
we show that a controller which follows an optimal control law can be learned
using off-policy temporal-difference control (i.e., Q-learning). Extensive
numerical experiments demonstrate the effectiveness of the proposed approach
for various configurations of the agent and the object of interest.Comment: 2022 IEEE 61st Conference on Decision and Control (CDC), 06-09
December 2022, Cancun, Mexic
Biological learning and artificial intelligence
It was once taken for granted that learning in animals and man could be explained with a simple set of general learning rules, but over the last hundred years, a substantial amount of evidence has been accumulated that points in a quite different direction. In animal learning theory, the laws of learning are no longer considered general. Instead, it has been necessary to explain behaviour in terms of a large set of interacting learning mechanisms and innate behaviours. Artificial intelligence is now on the edge of making the transition from general theories to a view of intelligence that is based on anamalgamate of interacting systems. In the light of the evidence from animal learning theory, such a transition is to be highly desired
Graph-based Reinforcement Learning meets Mixed Integer Programs: An application to 3D robot assembly discovery
Robot assembly discovery is a challenging problem that lives at the
intersection of resource allocation and motion planning. The goal is to combine
a predefined set of objects to form something new while considering task
execution with the robot-in-the-loop. In this work, we tackle the problem of
building arbitrary, predefined target structures entirely from scratch using a
set of Tetris-like building blocks and a robotic manipulator. Our novel
hierarchical approach aims at efficiently decomposing the overall task into
three feasible levels that benefit mutually from each other. On the high level,
we run a classical mixed-integer program for global optimization of block-type
selection and the blocks' final poses to recreate the desired shape. Its output
is then exploited to efficiently guide the exploration of an underlying
reinforcement learning (RL) policy. This RL policy draws its generalization
properties from a flexible graph-based representation that is learned through
Q-learning and can be refined with search. Moreover, it accounts for the
necessary conditions of structural stability and robotic feasibility that
cannot be effectively reflected in the previous layer. Lastly, a grasp and
motion planner transforms the desired assembly commands into robot joint
movements. We demonstrate our proposed method's performance on a set of
competitive simulated RAD environments, showcase real-world transfer, and
report performance and robustness gains compared to an unstructured end-to-end
approach. Videos are available at https://sites.google.com/view/rl-meets-milp
Belief Tree Search for Active Object Recognition
Active Object Recognition (AOR) has been approached as an unsupervised
learning problem, in which optimal trajectories for object inspection are not
known and are to be discovered by reducing label uncertainty measures or
training with reinforcement learning. Such approaches have no guarantees of the
quality of their solution. In this paper, we treat AOR as a Partially
Observable Markov Decision Process (POMDP) and find near-optimal policies on
training data using Belief Tree Search (BTS) on the corresponding belief Markov
Decision Process (MDP). AOR then reduces to the problem of knowledge transfer
from near-optimal policies on training set to the test set. We train a Long
Short Term Memory (LSTM) network to predict the best next action on the
training set rollouts. We sho that the proposed AOR method generalizes well to
novel views of familiar objects and also to novel objects. We compare this
supervised scheme against guided policy search, and find that the LSTM network
reaches higher recognition accuracy compared to the guided policy method. We
further look into optimizing the observation function to increase the total
collected reward of optimal policy. In AOR, the observation function is known
only approximately. We propose a gradient-based method update to this
approximate observation function to increase the total reward of any policy. We
show that by optimizing the observation function and retraining the supervised
LSTM network, the AOR performance on the test set improves significantly.Comment: IROS 201
Experimental results : Reinforcement Learning of POMDPs using Spectral Methods
We propose a new reinforcement learning algorithm for partially observable
Markov decision processes (POMDP) based on spectral decomposition methods.
While spectral methods have been previously employed for consistent learning of
(passive) latent variable models such as hidden Markov models, POMDPs are more
challenging since the learner interacts with the environment and possibly
changes the future observations in the process. We devise a learning algorithm
running through epochs, in each epoch we employ spectral techniques to learn
the POMDP parameters from a trajectory generated by a fixed policy. At the end
of the epoch, an optimization oracle returns the optimal memoryless planning
policy which maximizes the expected reward based on the estimated POMDP model.
We prove an order-optimal regret bound with respect to the optimal memoryless
policy and efficient scaling with respect to the dimensionality of observation
and action spaces.Comment: 30th Conference on Neural Information Processing Systems (NIPS 2016),
Barcelona, Spai
- …