63 research outputs found

    Switching Head-Tail Funnel UNITER for Dual Referring Expression Comprehension with Fetch-and-Carry Tasks

    Full text link
    This paper describes a domestic service robot (DSR) that fetches everyday objects and carries them to specified destinations according to free-form natural language instructions. Given an instruction such as "Move the bottle on the left side of the plate to the empty chair," the DSR is expected to identify the bottle and the chair from multiple candidates in the environment and carry the target object to the destination. Most of the existing multimodal language understanding methods are impractical in terms of computational complexity because they require inferences for all combinations of target object candidates and destination candidates. We propose Switching Head-Tail Funnel UNITER, which solves the task by predicting the target object and the destination individually using a single model. Our method is validated on a newly-built dataset consisting of object manipulation instructions and semi photo-realistic images captured in a standard Embodied AI simulator. The results show that our method outperforms the baseline method in terms of language comprehension accuracy. Furthermore, we conduct physical experiments in which a DSR delivers standardized everyday objects in a standardized domestic environment as requested by instructions with referring expressions. The experimental results show that the object grasping and placing actions are achieved with success rates of more than 90%.Comment: Accepted for presentation at IROS202

    CrossMap Transformer: A Crossmodal Masked Path Transformer Using Double Back-Translation for Vision-and-Language Navigation

    Full text link
    Navigation guided by natural language instructions is particularly suitable for Domestic Service Robots that interacts naturally with users. This task involves the prediction of a sequence of actions that leads to a specified destination given a natural language navigation instruction. The task thus requires the understanding of instructions, such as ``Walk out of the bathroom and wait on the stairs that are on the right''. The Visual and Language Navigation remains challenging, notably because it requires the exploration of the environment and at the accurate following of a path specified by the instructions to model the relationship between language and vision. To address this, we propose the CrossMap Transformer network, which encodes the linguistic and visual features to sequentially generate a path. The CrossMap transformer is tied to a Transformer-based speaker that generates navigation instructions. The two networks share common latent features, for mutual enhancement through a double back translation model: Generated paths are translated into instructions while generated instructions are translated into path The experimental results show the benefits of our approach in terms of instruction understanding and instruction generation.Comment: 8 pages, 5 figures, 5 tables. Submitted to IEEE Robotics and Automation Letter

    Language-Conditioned Imitation Learning with Base Skill Priors under Unstructured Data

    Full text link
    The growing interest in language-conditioned robot manipulation aims to develop robots capable of understanding and executing complex tasks, with the objective of enabling robots to interpret language commands and manipulate objects accordingly. While language-conditioned approaches demonstrate impressive capabilities for addressing tasks in familiar environments, they encounter limitations in adapting to unfamiliar environment settings. In this study, we propose a general-purpose, language-conditioned approach that combines base skill priors and imitation learning under unstructured data to enhance the algorithm's generalization in adapting to unfamiliar environments. We assess our model's performance in both simulated and real-world environments using a zero-shot setting. In the simulated environment, the proposed approach surpasses previously reported scores for CALVIN benchmark, especially in the challenging Zero-Shot Multi-Environment setting. The average completed task length, indicating the average number of tasks the agent can continuously complete, improves more than 2.5 times compared to the state-of-the-art method HULC. In addition, we conduct a zero-shot evaluation of our policy in a real-world setting, following training exclusively in simulated environments without additional specific adaptations. In this evaluation, we set up ten tasks and achieved an average 30% improvement in our approach compared to the current state-of-the-art approach, demonstrating a high generalization capability in both simulated environments and the real world. For further details, including access to our code and videos, please refer to our supplementary materials

    Video Understanding: A Predictive Analytics Perspective

    Get PDF
    This dissertation includes a detailed study of video predictive understanding, an emerging perspective on video-based computer vision research. This direction explores machine vision techniques to fill in missing spatiotemporal information in videos (e.g., predict the future), which is of great importance for understanding real world dynamics and benefits many applications. We investigate this direction with depth and breadth. Four emerging areas are considered and improved by our efforts: early action recognition, future activity prediction, trajectory prediction and procedure planning. For each, our research presents innovative solutions based on machine learning techniques (deep learning in particular) and meanwhile pays special attention to their interpretability, multi-modality and efficiency, which we consider as critical for next-generation Artificial Intelligence (AI). Finally, we conclude this dissertation by discussing current shortcomings as well as future directions
    • …
    corecore