Understanding human tasks through video observations is an essential
capability of intelligent agents. The challenges of such capability lie in the
difficulty of generating a detailed understanding of situated actions, their
effects on object states (i.e., state changes), and their causal dependencies.
These challenges are further aggravated by the natural parallelism from
multi-tasking and partial observations in multi-agent collaboration. Most prior
works leverage action localization or future prediction as an indirect metric
for evaluating such task understanding from videos. To make a direct
evaluation, we introduce the EgoTaskQA benchmark that provides a single home
for the crucial dimensions of task understanding through question-answering on
real-world egocentric videos. We meticulously design questions that target the
understanding of (1) action dependencies and effects, (2) intents and goals,
and (3) agents' beliefs about others. These questions are divided into four
types, including descriptive (what status?), predictive (what will?),
explanatory (what caused?), and counterfactual (what if?) to provide diagnostic
analyses on spatial, temporal, and causal understandings of goal-oriented
tasks. We evaluate state-of-the-art video reasoning models on our benchmark and
show their significant gaps between humans in understanding complex
goal-oriented egocentric videos. We hope this effort will drive the vision
community to move onward with goal-oriented video understanding and reasoning.Comment: Published at NeurIPS Track on Datasets and Benchmarks 202