Search CORE

12,231 research outputs found

ARNOLD: A Benchmark for Language-Grounded Task Learning With Continuous States in Realistic 3D Scenes

Author: Ai Wensi
Gao Xiaofeng
Geng Haoran
Gong Ran
Huang Jiangyong
Huang Siyuan
Jia Baoxiong
Terzopoulos Demetri
Wu Qingyang
Zhao Yizhou
Zhou Ziheng
Zhu Song-Chun
Publication venue
Publication date: 09/04/2023
Field of study

Understanding the continuous states of objects is essential for task learning and planning in the real world. However, most existing task learning benchmarks assume discrete(e.g., binary) object goal states, which poses challenges for the learning of complex tasks and transferring learned policy from simulated environments to the real world. Furthermore, state discretization limits a robot's ability to follow human instructions based on the grounding of actions and states. To tackle these challenges, we present ARNOLD, a benchmark that evaluates language-grounded task learning with continuous states in realistic 3D scenes. ARNOLD is comprised of 8 language-conditioned tasks that involve understanding object states and learning policies for continuous goals. To promote language-instructed learning, we provide expert demonstrations with template-generated language descriptions. We assess task performance by utilizing the latest language-conditioned policy learning models. Our results indicate that current models for language-conditioned manipulations continue to experience significant challenges in novel goal-state generalizations, scene generalizations, and object generalizations. These findings highlight the need to develop new algorithms that address this gap and underscore the potential for further research in this area. See our project page at: https://arnold-benchmark.github.ioComment: The first two authors contributed equally; 20 pages; 17 figures; project availalbe: https://arnold-benchmark.github.io

arXiv.org e-Print Archive

On the Value of Out-of-Distribution Testing: An Example of Goodhart's Law

Author: Abbasnejad Ehsan
Hengel Anton van den
Kafle Kushal
Kanan Christopher
Shrestha Robik
Teney Damien
Publication venue
Publication date: 01/01/2020
Field of study

Out-of-distribution (OOD) testing is increasingly popular for evaluating a machine learning system's ability to generalize beyond the biases of a training set. OOD benchmarks are designed to present a different joint distribution of data and labels between training and test time. VQA-CP has become the standard OOD benchmark for visual question answering, but we discovered three troubling practices in its current use. First, most published methods rely on explicit knowledge of the construction of the OOD splits. They often rely on ``inverting'' the distribution of labels, e.g. answering mostly 'yes' when the common training answer is 'no'. Second, the OOD test set is used for model selection. Third, a model's in-domain performance is assessed after retraining it on in-domain splits (VQA v2) that exhibit a more balanced distribution of labels. These three practices defeat the objective of evaluating generalization, and put into question the value of methods specifically designed for this dataset. We show that embarrassingly-simple methods, including one that generates answers at random, surpass the state of the art on some question types. We provide short- and long-term solutions to avoid these pitfalls and realize the benefits of OOD evaluation

arXiv.org e-Print Archive

Adelaide Research & Scholarship