1,182 research outputs found
Deep compositional robotic planners that follow natural language commands
We demonstrate how a sampling-based robotic planner can be augmented to learn
to understand a sequence of natural language commands in a continuous
configuration space to move and manipulate objects. Our approach combines a
deep network structured according to the parse of a complex command that
includes objects, verbs, spatial relations, and attributes, with a
sampling-based planner, RRT. A recurrent hierarchical deep network controls how
the planner explores the environment, determines when a planned path is likely
to achieve a goal, and estimates the confidence of each move to trade off
exploitation and exploration between the network and the planner. Planners are
designed to have near-optimal behavior when information about the task is
missing, while networks learn to exploit observations which are available from
the environment, making the two naturally complementary. Combining the two
enables generalization to new maps, new kinds of obstacles, and more complex
sentences that do not occur in the training set. Little data is required to
train the model despite it jointly acquiring a CNN that extracts features from
the environment as it learns the meanings of words. The model provides a level
of interpretability through the use of attention maps allowing users to see its
reasoning steps despite being an end-to-end model. This end-to-end model allows
robots to learn to follow natural language commands in challenging continuous
environments.Comment: Accepted in ICRA 202
Encoding formulas as deep networks: Reinforcement learning for zero-shot execution of LTL formulas
We demonstrate a reinforcement learning agent which uses a compositional
recurrent neural network that takes as input an LTL formula and determines
satisfying actions. The input LTL formulas have never been seen before, yet the
network performs zero-shot generalization to satisfy them. This is a novel form
of multi-task learning for RL agents where agents learn from one diverse set of
tasks and generalize to a new set of diverse tasks. The formulation of the
network enables this capacity to generalize. We demonstrate this ability in two
domains. In a symbolic domain, the agent finds a sequence of letters that is
accepted. In a Minecraft-like environment, the agent finds a sequence of
actions that conform to the formula. While prior work could learn to execute
one formula reliably given examples of that formula, we demonstrate how to
encode all formulas reliably. This could form the basis of new multitask agents
that discover sub-tasks and execute them without any additional training, as
well as the agents which follow more complex linguistic commands. The
structures required for this generalization are specific to LTL formulas, which
opens up an interesting theoretical question: what structures are required in
neural networks for zero-shot generalization to different logics?Comment: Accepted in IROS 202
Compositional Networks Enable Systematic Generalization for Grounded Language Understanding
Humans are remarkably flexible when understanding new sentences that include
combinations of concepts they have never encountered before. Recent work has
shown that while deep networks can mimic some human language abilities when
presented with novel sentences, systematic variation uncovers the limitations
in the language-understanding abilities of neural networks. We demonstrate that
these limitations can be overcome by addressing the generalization challenges
in a recently-released dataset, gSCAN, which explicitly measures how well a
robotic agent is able to interpret novel ideas grounded in vision, e.g., novel
pairings of adjectives and nouns. The key principle we employ is
compositionality: that the compositional structure of networks should reflect
the compositional structure of the problem domain they address, while allowing
all other parameters and properties to be learned end-to-end with weak
supervision. We build a general-purpose mechanism that enables robots to
generalize their language understanding to compositional domains. Crucially,
our base network has the same state-of-the-art performance as prior work, 97%
execution accuracy, while at the same time generalizing its knowledge when
prior work does not; for example, achieving 95% accuracy on novel
adjective-noun compositions where previous work has 55% average accuracy.
Robust language understanding without dramatic failures and without corner
causes is critical to building safe and fair robots; we demonstrate the
significant role that compositionality can play in achieving that goal
Learning a natural-language to LTL executable semantic parser for grounded robotics
Children acquire their native language with apparent ease by observing how
language is used in context and attempting to use it themselves. They do so
without laborious annotations, negative examples, or even direct corrections.
We take a step toward robots that can do the same by training a grounded
semantic parser, which discovers latent linguistic representations that can be
used for the execution of natural-language commands. In particular, we focus on
the difficult domain of commands with a temporal aspect, whose semantics we
capture with Linear Temporal Logic, LTL. Our parser is trained with pairs of
sentences and executions as well as an executor. At training time, the parser
hypothesizes a meaning representation for the input as a formula in LTL. Three
competing pressures allow the parser to discover meaning from language. First,
any hypothesized meaning for a sentence must be permissive enough to reflect
all the annotated execution trajectories. Second, the executor -- a pretrained
end-to-end LTL planner -- must find that the observe trajectories are likely
executions of the meaning. Finally, a generator, which reconstructs the
original input, encourages the model to find representations that conserve
knowledge about the command. Together these ensure that the meaning is neither
too general nor too specific. Our model generalizes well, being able to parse
and execute both machine-generated and human-generated commands, with
near-equal accuracy, despite the fact that the human-generated sentences are
much more varied and complex with an open lexicon. The approach presented here
is not specific to LTL: it can be applied to any domain where sentence meanings
can be hypothesized and an executor can verify these meanings, thus opening the
door to many applications for robotic agents.Comment: 10 pages, 2 figures, Accepted in Conference on Robot Learning (CoRL)
202
Summarize the Past to Predict the Future: Natural Language Descriptions of Context Boost Multimodal Object Interaction
We study the task of object interaction anticipation in egocentric videos.
Successful prediction of future actions and objects requires an understanding
of the spatio-temporal context formed by past actions and object relationships.
We propose TransFusion, a multimodal transformer-based architecture, that
effectively makes use of the representational power of language by summarizing
past actions concisely. TransFusion leverages pre-trained image captioning
models and summarizes the caption, focusing on past actions and objects. This
action context together with a single input frame is processed by a multimodal
fusion module to forecast the next object interactions. Our model enables more
efficient end-to-end learning by replacing dense video features with language
representations, allowing us to benefit from knowledge encoded in large
pre-trained models. Experiments on Ego4D and EPIC-KITCHENS-100 show the
effectiveness of our multimodal fusion model and the benefits of using
language-based context summaries. Our method outperforms state-of-the-art
approaches by 40.4% in overall mAP on the Ego4D test set. We show the
generality of TransFusion via experiments on EPIC-KITCHENS-100. Video and code
are available at: https://eth-ait.github.io/transfusion-proj/
Neural Amortized Inference for Nested Multi-agent Reasoning
Multi-agent interactions, such as communication, teaching, and bluffing,
often rely on higher-order social inference, i.e., understanding how others
infer oneself. Such intricate reasoning can be effectively modeled through
nested multi-agent reasoning. Nonetheless, the computational complexity
escalates exponentially with each level of reasoning, posing a significant
challenge. However, humans effortlessly perform complex social inferences as
part of their daily lives. To bridge the gap between human-like inference
capabilities and computational limitations, we propose a novel approach:
leveraging neural networks to amortize high-order social inference, thereby
expediting nested multi-agent reasoning. We evaluate our method in two
challenging multi-agent interaction domains. The experimental results
demonstrate that our method is computationally efficient while exhibiting
minimal degradation in accuracy.Comment: 8 pages, 10 figure
Case report: VA-ECMO for fulminant myocarditis in an infant with acute COVID-19
Fulminant myocarditis in children was rare during the coronavirus disease 2019 pandemic, but it had the potential for high morbidity and mortality. We describe the clinical course of a previously healthy 9-month-old young male infant who rapidly deteriorated into cardiogenic shock due to coronavirus disease 2019-related fulminant myocarditis. He developed severe heart failure and multiple organ dysfunction syndrome that were treated promptly with central venoarterial extracorporeal membrane oxygenation and continuous venovenous hemofiltration. He made a good recovery without significant morbidity
- …