174 research outputs found
Visual7W: Grounded Question Answering in Images
We have seen great progress in basic perceptual tasks such as object
recognition and detection. However, AI models still fail to match humans in
high-level vision tasks due to the lack of capacities for deeper reasoning.
Recently the new task of visual question answering (QA) has been proposed to
evaluate a model's capacity for deep image understanding. Previous works have
established a loose, global association between QA sentences and images.
However, many questions and answers, in practice, relate to local regions in
the images. We establish a semantic link between textual descriptions and image
regions by object-level grounding. It enables a new type of QA with visual
answers, in addition to textual answers used in previous work. We study the
visual QA tasks in a grounded setting with a large collection of 7W
multiple-choice QA pairs. Furthermore, we evaluate human performance and
several baseline models on the QA tasks. Finally, we propose a novel LSTM model
with spatial attention to tackle the 7W QA tasks.Comment: CVPR 201
Action Recognition by Hierarchical Mid-level Action Elements
Realistic videos of human actions exhibit rich spatiotemporal structures at
multiple levels of granularity: an action can always be decomposed into
multiple finer-grained elements in both space and time. To capture this
intuition, we propose to represent videos by a hierarchy of mid-level action
elements (MAEs), where each MAE corresponds to an action-related spatiotemporal
segment in the video. We introduce an unsupervised method to generate this
representation from videos. Our method is capable of distinguishing
action-related segments from background segments and representing actions at
multiple spatiotemporal resolutions. Given a set of spatiotemporal segments
generated from the training data, we introduce a discriminative clustering
algorithm that automatically discovers MAEs at multiple levels of granularity.
We develop structured models that capture a rich set of spatial, temporal and
hierarchical relations among the segments, where the action label and multiple
levels of MAE labels are jointly inferred. The proposed model achieves
state-of-the-art performance in multiple action recognition benchmarks.
Moreover, we demonstrate the effectiveness of our model in real-world
applications such as action recognition in large-scale untrimmed videos and
action parsing
Scene Graph Generation by Iterative Message Passing
Understanding a visual scene goes beyond recognizing individual objects in
isolation. Relationships between objects also constitute rich semantic
information about the scene. In this work, we explicitly model the objects and
their relationships using scene graphs, a visually-grounded graphical structure
of an image. We propose a novel end-to-end model that generates such structured
scene representation from an input image. The model solves the scene graph
inference problem using standard RNNs and learns to iteratively improves its
predictions via message passing. Our joint inference model can take advantage
of contextual cues to make better predictions on objects and their
relationships. The experiments show that our model significantly outperforms
previous methods for generating scene graphs using Visual Genome dataset and
inferring support relations with NYU Depth v2 dataset.Comment: CVPR 201
Learning Generalizable Manipulation Policies with Object-Centric 3D Representations
We introduce GROOT, an imitation learning method for learning robust policies
with object-centric and 3D priors. GROOT builds policies that generalize beyond
their initial training conditions for vision-based manipulation. It constructs
object-centric 3D representations that are robust toward background changes and
camera views and reason over these representations using a transformer-based
policy. Furthermore, we introduce a segmentation correspondence model that
allows policies to generalize to new objects at test time. Through
comprehensive experiments, we validate the robustness of GROOT policies against
perceptual variations in simulated and real-world environments. GROOT's
performance excels in generalization over background changes, camera viewpoint
shifts, and the presence of new object instances, whereas both state-of-the-art
end-to-end learning methods and object proposal-based approaches fall short. We
also extensively evaluate GROOT policies on real robots, where we demonstrate
the efficacy under very wild changes in setup. More videos and model details
can be found in the appendix and the project website:
https://ut-austin-rpl.github.io/GROOT .Comment: Accepted at the 7th Annual Conference on Robot Learning (CoRL), 2023
in Atlanta, U
LOTUS: Continual Imitation Learning for Robot Manipulation Through Unsupervised Skill Discovery
We introduce LOTUS, a continual imitation learning algorithm that empowers a
physical robot to continuously and efficiently learn to solve new manipulation
tasks throughout its lifespan. The core idea behind LOTUS is constructing an
ever-growing skill library from a sequence of new tasks with a small number of
human demonstrations. LOTUS starts with a continual skill discovery process
using an open-vocabulary vision model, which extracts skills as recurring
patterns presented in unsegmented demonstrations. Continual skill discovery
updates existing skills to avoid catastrophic forgetting of previous tasks and
adds new skills to solve novel tasks. LOTUS trains a meta-controller that
flexibly composes various skills to tackle vision-based manipulation tasks in
the lifelong learning process. Our comprehensive experiments show that LOTUS
outperforms state-of-the-art baselines by over 11% in success rate, showing its
superior knowledge transfer ability compared to prior methods. More results and
videos can be found on the project website:
https://ut-austin-rpl.github.io/Lotus/.Comment: ICRA 202
AMAGO: Scalable In-Context Reinforcement Learning for Adaptive Agents
We introduce AMAGO, an in-context Reinforcement Learning (RL) agent that uses
sequence models to tackle the challenges of generalization, long-term memory,
and meta-learning. Recent works have shown that off-policy learning can make
in-context RL with recurrent policies viable. Nonetheless, these approaches
require extensive tuning and limit scalability by creating key bottlenecks in
agents' memory capacity, planning horizon, and model size. AMAGO revisits and
redesigns the off-policy in-context approach to successfully train
long-sequence Transformers over entire rollouts in parallel with end-to-end RL.
Our agent is scalable and applicable to a wide range of problems, and we
demonstrate its strong performance empirically in meta-RL and long-term memory
domains. AMAGO's focus on sparse rewards and off-policy data also allows
in-context learning to extend to goal-conditioned problems with challenging
exploration. When combined with a multi-goal hindsight relabeling scheme, AMAGO
can solve a previously difficult category of open-world domains, where agents
complete many possible instructions in procedurally generated environments.Comment: ICLR 202
Doduo: Learning Dense Visual Correspondence from Unsupervised Semantic-Aware Flow
Dense visual correspondence plays a vital role in robotic perception. This
work focuses on establishing the dense correspondence between a pair of images
that captures dynamic scenes undergoing substantial transformations. We
introduce Doduo to learn general dense visual correspondence from in-the-wild
images and videos without ground truth supervision. Given a pair of images, it
estimates the dense flow field encoding the displacement of each pixel in one
image to its corresponding pixel in the other image. Doduo uses flow-based
warping to acquire supervisory signals for the training. Incorporating semantic
priors with self-supervised flow training, Doduo produces accurate dense
correspondence robust to the dynamic changes of the scenes. Trained on an
in-the-wild video dataset, Doduo illustrates superior performance on
point-level correspondence estimation over existing self-supervised
correspondence learning baselines. We also apply Doduo to articulation
estimation and zero-shot goal-conditioned manipulation, underlining its
practical applications in robotics. Code and additional visualizations are
available at https://ut-austin-rpl.github.io/DoduoComment: Project website: https://ut-austin-rpl.github.io/Dodu
MUTEX: Learning Unified Policies from Multimodal Task Specifications
Humans use different modalities, such as speech, text, images, videos, etc.,
to communicate their intent and goals with teammates. For robots to become
better assistants, we aim to endow them with the ability to follow instructions
and understand tasks specified by their human partners. Most robotic policy
learning methods have focused on one single modality of task specification
while ignoring the rich cross-modal information. We present MUTEX, a unified
approach to policy learning from multimodal task specifications. It trains a
transformer-based architecture to facilitate cross-modal reasoning, combining
masked modeling and cross-modal matching objectives in a two-stage training
procedure. After training, MUTEX can follow a task specification in any of the
six learned modalities (video demonstrations, goal images, text goal
descriptions, text instructions, speech goal descriptions, and speech
instructions) or a combination of them. We systematically evaluate the benefits
of MUTEX in a newly designed dataset with 100 tasks in simulation and 50 tasks
in the real world, annotated with multiple instances of task specifications in
different modalities, and observe improved performance over methods trained
specifically for any single modality. More information at
https://ut-austin-rpl.github.io/MUTEX/Comment: Accepted at 7th Conference on Robot Learning (CoRL 2023), Atlanta,
US
- …