Search CORE

14,317 research outputs found

Gated-Attention Architectures for Task-Oriented Language Grounding

Author: Chaplot Devendra Singh
Pasumarthi Rama Kumar
Rajagopal Dheeraj
Salakhutdinov Ruslan
Sathyendra Kanthashree Mysore
Publication venue
Publication date: 08/01/2018
Field of study

To perform tasks specified by natural language instructions, autonomous agents need to extract semantically meaningful representations of language and map it to visual elements and actions in the environment. This problem is called task-oriented language grounding. We propose an end-to-end trainable neural architecture for task-oriented language grounding in 3D environments which assumes no prior linguistic or perceptual knowledge and requires only raw pixels from the environment and the natural language instruction as input. The proposed model combines the image and text representations using a Gated-Attention mechanism and learns a policy to execute the natural language instruction using standard reinforcement and imitation learning methods. We show the effectiveness of the proposed model on unseen instructions as well as unseen maps, both quantitatively and qualitatively. We also introduce a novel environment based on a 3D game engine to simulate the challenges of task-oriented language grounding over a rich set of instructions and environment states.Comment: To appear in AAAI-1

arXiv.org e-Print Archive

Association for the Advancement of Artificial Intelligence: AAAI Publications

VIMA: General Robot Manipulation with Multimodal Prompts

Author: Anandkumar Anima
Chen Yanjun
Dou Yongqiang
Fan Linxi
Fei-Fei Li
Gupta Agrim
Jiang Yunfan
Wang Guanzhi
Zhang Zichen
Zhu Yuke
Publication venue
Publication date: 28/05/2023
Field of study

Prompt-based learning has emerged as a successful paradigm in natural language processing, where a single general-purpose language model can be instructed to perform any task specified by input prompts. Yet task specification in robotics comes in various forms, such as imitating one-shot demonstrations, following language instructions, and reaching visual goals. They are often considered different tasks and tackled by specialized models. We show that a wide spectrum of robot manipulation tasks can be expressed with multimodal prompts, interleaving textual and visual tokens. Accordingly, we develop a new simulation benchmark that consists of thousands of procedurally-generated tabletop tasks with multimodal prompts, 600K+ expert trajectories for imitation learning, and a four-level evaluation protocol for systematic generalization. We design a transformer-based robot agent, VIMA, that processes these prompts and outputs motor actions autoregressively. VIMA features a recipe that achieves strong model scalability and data efficiency. It outperforms alternative designs in the hardest zero-shot generalization setting by up to

2.9\times

task success rate given the same training data. With

10\times

less training data, VIMA still performs

2.7\times

better than the best competing variant. Code and video demos are available at https://vimalabs.github.io/Comment: ICML 2023 Camera-ready version. Project website: https://vimalabs.github.io

arXiv.org e-Print Archive