63 research outputs found
Switching Head-Tail Funnel UNITER for Dual Referring Expression Comprehension with Fetch-and-Carry Tasks
This paper describes a domestic service robot (DSR) that fetches everyday
objects and carries them to specified destinations according to free-form
natural language instructions. Given an instruction such as "Move the bottle on
the left side of the plate to the empty chair," the DSR is expected to identify
the bottle and the chair from multiple candidates in the environment and carry
the target object to the destination. Most of the existing multimodal language
understanding methods are impractical in terms of computational complexity
because they require inferences for all combinations of target object
candidates and destination candidates. We propose Switching Head-Tail Funnel
UNITER, which solves the task by predicting the target object and the
destination individually using a single model. Our method is validated on a
newly-built dataset consisting of object manipulation instructions and semi
photo-realistic images captured in a standard Embodied AI simulator. The
results show that our method outperforms the baseline method in terms of
language comprehension accuracy. Furthermore, we conduct physical experiments
in which a DSR delivers standardized everyday objects in a standardized
domestic environment as requested by instructions with referring expressions.
The experimental results show that the object grasping and placing actions are
achieved with success rates of more than 90%.Comment: Accepted for presentation at IROS202
CrossMap Transformer: A Crossmodal Masked Path Transformer Using Double Back-Translation for Vision-and-Language Navigation
Navigation guided by natural language instructions is particularly suitable
for Domestic Service Robots that interacts naturally with users. This task
involves the prediction of a sequence of actions that leads to a specified
destination given a natural language navigation instruction. The task thus
requires the understanding of instructions, such as ``Walk out of the bathroom
and wait on the stairs that are on the right''. The Visual and Language
Navigation remains challenging, notably because it requires the exploration of
the environment and at the accurate following of a path specified by the
instructions to model the relationship between language and vision. To address
this, we propose the CrossMap Transformer network, which encodes the linguistic
and visual features to sequentially generate a path. The CrossMap transformer
is tied to a Transformer-based speaker that generates navigation instructions.
The two networks share common latent features, for mutual enhancement through a
double back translation model: Generated paths are translated into instructions
while generated instructions are translated into path The experimental results
show the benefits of our approach in terms of instruction understanding and
instruction generation.Comment: 8 pages, 5 figures, 5 tables. Submitted to IEEE Robotics and
Automation Letter
Language-Conditioned Imitation Learning with Base Skill Priors under Unstructured Data
The growing interest in language-conditioned robot manipulation aims to
develop robots capable of understanding and executing complex tasks, with the
objective of enabling robots to interpret language commands and manipulate
objects accordingly. While language-conditioned approaches demonstrate
impressive capabilities for addressing tasks in familiar environments, they
encounter limitations in adapting to unfamiliar environment settings. In this
study, we propose a general-purpose, language-conditioned approach that
combines base skill priors and imitation learning under unstructured data to
enhance the algorithm's generalization in adapting to unfamiliar environments.
We assess our model's performance in both simulated and real-world environments
using a zero-shot setting. In the simulated environment, the proposed approach
surpasses previously reported scores for CALVIN benchmark, especially in the
challenging Zero-Shot Multi-Environment setting. The average completed task
length, indicating the average number of tasks the agent can continuously
complete, improves more than 2.5 times compared to the state-of-the-art method
HULC. In addition, we conduct a zero-shot evaluation of our policy in a
real-world setting, following training exclusively in simulated environments
without additional specific adaptations. In this evaluation, we set up ten
tasks and achieved an average 30% improvement in our approach compared to the
current state-of-the-art approach, demonstrating a high generalization
capability in both simulated environments and the real world. For further
details, including access to our code and videos, please refer to our
supplementary materials
Video Understanding: A Predictive Analytics Perspective
This dissertation includes a detailed study of video predictive understanding, an emerging perspective on video-based computer vision research. This direction explores machine vision techniques to fill in missing spatiotemporal information in videos (e.g., predict the future), which is of great importance for understanding real world dynamics and benefits many applications. We investigate this direction with depth and breadth. Four emerging areas are considered and improved by our efforts: early action recognition, future activity prediction, trajectory prediction and procedure planning. For each, our research presents innovative solutions based on machine learning techniques (deep learning in particular) and meanwhile pays special attention to their interpretability, multi-modality and efficiency, which we consider as critical for next-generation Artificial Intelligence (AI). Finally, we conclude this dissertation by discussing current shortcomings as well as future directions
- …