11 research outputs found

    EgoTaskQA: Understanding Human Tasks in Egocentric Videos

    Full text link
    Understanding human tasks through video observations is an essential capability of intelligent agents. The challenges of such capability lie in the difficulty of generating a detailed understanding of situated actions, their effects on object states (i.e., state changes), and their causal dependencies. These challenges are further aggravated by the natural parallelism from multi-tasking and partial observations in multi-agent collaboration. Most prior works leverage action localization or future prediction as an indirect metric for evaluating such task understanding from videos. To make a direct evaluation, we introduce the EgoTaskQA benchmark that provides a single home for the crucial dimensions of task understanding through question-answering on real-world egocentric videos. We meticulously design questions that target the understanding of (1) action dependencies and effects, (2) intents and goals, and (3) agents' beliefs about others. These questions are divided into four types, including descriptive (what status?), predictive (what will?), explanatory (what caused?), and counterfactual (what if?) to provide diagnostic analyses on spatial, temporal, and causal understandings of goal-oriented tasks. We evaluate state-of-the-art video reasoning models on our benchmark and show their significant gaps between humans in understanding complex goal-oriented egocentric videos. We hope this effort will drive the vision community to move onward with goal-oriented video understanding and reasoning.Comment: Published at NeurIPS Track on Datasets and Benchmarks 202

    Perceive, Ground, Reason, and Act: A Benchmark for General-purpose Visual Representation

    Full text link
    Current computer vision models, unlike the human visual system, cannot yet achieve general-purpose visual understanding. Existing efforts to create a general vision model are limited in the scope of assessed tasks and offer no overarching framework to perform them holistically. We present a new comprehensive benchmark, General-purpose Visual Understanding Evaluation (G-VUE), covering the full spectrum of visual cognitive abilities with four functional domains \unicode{x2014} Perceive, Ground, Reason, and Act. The four domains are embodied in 11 carefully curated tasks, from 3D reconstruction to visual reasoning and manipulation. Along with the benchmark, we provide a general encoder-decoder framework to allow for the evaluation of arbitrary visual representation on all 11 tasks. We evaluate various pre-trained visual representations with our framework and observe that (1) Transformer-based visual backbone generally outperforms CNN-based backbone on G-VUE, (2) visual representations from vision-language pre-training are superior to those with vision-only pre-training across visual tasks. With G-VUE, we provide a holistic evaluation standard to motivate research toward building general-purpose visual systems via obtaining more general-purpose visual representations

    X-VoE: Measuring eXplanatory Violation of Expectation in Physical Events

    Full text link
    Intuitive physics is pivotal for human understanding of the physical world, enabling prediction and interpretation of events even in infancy. Nonetheless, replicating this level of intuitive physics in artificial intelligence (AI) remains a formidable challenge. This study introduces X-VoE, a comprehensive benchmark dataset, to assess AI agents' grasp of intuitive physics. Built on the developmental psychology-rooted Violation of Expectation (VoE) paradigm, X-VoE establishes a higher bar for the explanatory capacities of intuitive physics models. Each VoE scenario within X-VoE encompasses three distinct settings, probing models' comprehension of events and their underlying explanations. Beyond model evaluation, we present an explanation-based learning system that captures physics dynamics and infers occluded object states solely from visual sequences, without explicit occlusion labels. Experimental outcomes highlight our model's alignment with human commonsense when tested against X-VoE. A remarkable feature is our model's ability to visually expound VoE events by reconstructing concealed scenes. Concluding, we discuss the findings' implications and outline future research directions. Through X-VoE, we catalyze the advancement of AI endowed with human-like intuitive physics capabilities.Comment: 19 pages, 16 figures, selected for an Oral presentation at ICCV 2023. Project link: https://pku.ai/publication/intuitive2023iccv

    ARNOLD: A Benchmark for Language-Grounded Task Learning With Continuous States in Realistic 3D Scenes

    Full text link
    Understanding the continuous states of objects is essential for task learning and planning in the real world. However, most existing task learning benchmarks assume discrete(e.g., binary) object goal states, which poses challenges for the learning of complex tasks and transferring learned policy from simulated environments to the real world. Furthermore, state discretization limits a robot's ability to follow human instructions based on the grounding of actions and states. To tackle these challenges, we present ARNOLD, a benchmark that evaluates language-grounded task learning with continuous states in realistic 3D scenes. ARNOLD is comprised of 8 language-conditioned tasks that involve understanding object states and learning policies for continuous goals. To promote language-instructed learning, we provide expert demonstrations with template-generated language descriptions. We assess task performance by utilizing the latest language-conditioned policy learning models. Our results indicate that current models for language-conditioned manipulations continue to experience significant challenges in novel goal-state generalizations, scene generalizations, and object generalizations. These findings highlight the need to develop new algorithms that address this gap and underscore the potential for further research in this area. See our project page at: https://arnold-benchmark.github.ioComment: The first two authors contributed equally; 20 pages; 17 figures; project availalbe: https://arnold-benchmark.github.io

    Incorporating World Model Knowledge into Event Parsing, Prediction, and Reasoning

    No full text
    Event understanding is one of the most fundamental problems in artificial intelligence and computer vision. Rooted in the field of neuroscience, the study and analysis of human motion perception have long suggested that we perceive human activities as goal-directed behaviors. As an essential capability of humans, we interpret others’ goals and learn tasks through the endless video stream of daily activities. To endow machines with the same intelligent behaviors, the challenges of emerging such a capability lie in the difficulty of generating a detailed understanding of world model knowledge including situated actions, their effects on object states (i.e., state changes), and their causal dependencies. These challenges are further aggravated by the natural parallelism in human multi-tasking, and partial observations originated from both the egocentric perception and uncertainties in estimating others’ beliefs in multi-agent collaborations.In this dissertation, we propose to study this missing gap from both the data and the modeling perspective by incorporating knowledge of the world model for proper event parsing, prediction, and reasoning. First, we propose three datasets, RAVEN, LEMMA, and EgoTaskQA, to study the event understanding problem from both the abstract and real domain. We further devise three benchmarks to evaluate models’ detailed understanding of events with (1) intelligence tests for spatial-temporal reasoning in RAVEN, (2) compositional action recognition and prediction in LEMMA, and (3) task-conditioned question answering in EgoTaskQA. Next, from the modeling side, we decompose the problem of event understanding into a unified framework that involves three essential modules: grounding, inference, and the knowledge base. To properly solve the problem of detailed event understanding, we need to focus on (1) the perception problem for grounding, (2) the knowledge representation problem, and (3) the inference problem. For the perception problem, we discuss the potential in existing models and propose the BO-QSA for the unsupervised emergence of object-centric concepts. For the inference problem, we discuss ways to initialize the overall framework with (1) PrAE which makes use of probabilistic abductions given logical rules, and (2) GEP which leverages stochastic context-free grammars for modeling. We conduct experiments to show their effectiveness on various tasks and also discuss the limitations of each proposed work tohighlight immediate next steps for possible future directions

    Human Activity Understanding and Prediction with Stochastic Grammar

    Get PDF
    Video understanding is a booming research problem in computer vision. With its innate feature where spatial and temporal information entangles with each other, video understanding has been challenging mainly because of the difficulty for having a unified framework where these two aspects can be modeled jointly. Among the tasks in video understanding, human activity understanding and prediction serve as a good starting point where the spatial-temporal reasoning capability of learning modules can be tested. Most of the current approaches towards solving the human activity understanding and prediction problems use deep neural networks for spatial-temporal reasoning. However, this type of approach lacks the ability to reason beyond the local frames and conduct long-term temporal reasoning. On the other hand, stochastic grammar models are used to model observed sequences on a symbolic level with all history information considered, but they perform poorly on handling noisy input sequences. Given these insights and problems of current approaches, we propose the generalized Earley parser for bridging the gap between sequence inputs and symbolic grammars. By combining the advantages of these two types of methods, we show that the proposed model achieves a better performance on both human activity recognition and future prediction
    corecore