13 research outputs found
A Closer Look at Temporal Ordering in the Segmentation of Instructional Videos
Understanding the steps required to perform a task is an important skill for
AI systems. Learning these steps from instructional videos involves two
subproblems: (i) identifying the temporal boundary of sequentially occurring
segments and (ii) summarizing these steps in natural language. We refer to this
task as Procedure Segmentation and Summarization (PSS). In this paper, we take
a closer look at PSS and propose three fundamental improvements over current
methods. The segmentation task is critical, as generating a correct summary
requires each step of the procedure to be correctly identified. However,
current segmentation metrics often overestimate the segmentation quality
because they do not consider the temporal order of segments. In our first
contribution, we propose a new segmentation metric that takes into account the
order of segments, giving a more reliable measure of the accuracy of a given
predicted segmentation. Current PSS methods are typically trained by proposing
segments, matching them with the ground truth and computing a loss. However,
much like segmentation metrics, existing matching algorithms do not consider
the temporal order of the mapping between candidate segments and the ground
truth. In our second contribution, we propose a matching algorithm that
constrains the temporal order of segment mapping, and is also differentiable.
Lastly, we introduce multi-modal feature training for PSS, which further
improves segmentation. We evaluate our approach on two instructional video
datasets (YouCook2 and Tasty) and observe an improvement over the
state-of-the-art of and for procedure segmentation and
summarization, respectively.Comment: Accepted at BMVC 202
Long-Term Anticipation of Activities with Cycle Consistency
With the success of deep learning methods in analyzing activities in videos,
more attention has recently been focused towards anticipating future
activities. However, most of the work on anticipation either analyzes a
partially observed activity or predicts the next action class. Recently, new
approaches have been proposed to extend the prediction horizon up to several
minutes in the future and that anticipate a sequence of future activities
including their durations. While these works decouple the semantic
interpretation of the observed sequence from the anticipation task, we propose
a framework for anticipating future activities directly from the features of
the observed frames and train it in an end-to-end fashion. Furthermore, we
introduce a cycle consistency loss over time by predicting the past activities
given the predicted future. Our framework achieves state-of-the-art results on
two datasets: the Breakfast dataset and 50Salads.Comment: GCPR 202
My View is the Best View: Procedure Learning from Egocentric Videos
Procedure learning involves identifying the key-steps and determining their
logical order to perform a task. Existing approaches commonly use third-person
videos for learning the procedure, making the manipulated object small in
appearance and often occluded by the actor, leading to significant errors. In
contrast, we observe that videos obtained from first-person (egocentric)
wearable cameras provide an unobstructed and clear view of the action. However,
procedure learning from egocentric videos is challenging because (a) the camera
view undergoes extreme changes due to the wearer's head motion, and (b) the
presence of unrelated frames due to the unconstrained nature of the videos. Due
to this, current state-of-the-art methods' assumptions that the actions occur
at approximately the same time and are of the same duration, do not hold.
Instead, we propose to use the signal provided by the temporal correspondences
between key-steps across videos. To this end, we present a novel
self-supervised Correspond and Cut (CnC) framework for procedure learning. CnC
identifies and utilizes the temporal correspondences between the key-steps
across multiple videos to learn the procedure. Our experiments show that CnC
outperforms the state-of-the-art on the benchmark ProceL and CrossTask datasets
by 5.2% and 6.3%, respectively. Furthermore, for procedure learning using
egocentric videos, we propose the EgoProceL dataset consisting of 62 hours of
videos captured by 130 subjects performing 16 tasks. The source code and the
dataset are available on the project page https://sid2697.github.io/egoprocel/.Comment: 25 pages, 6 figures, Accepted in European Conference on Computer
Vision (ECCV) 202
AntGPT: Can Large Language Models Help Long-term Action Anticipation from Videos?
Can we better anticipate an actor's future actions (e.g. mix eggs) by knowing
what commonly happens after his/her current action (e.g. crack eggs)? What if
we also know the longer-term goal of the actor (e.g. making egg fried rice)?
The long-term action anticipation (LTA) task aims to predict an actor's future
behavior from video observations in the form of verb and noun sequences, and it
is crucial for human-machine interaction. We propose to formulate the LTA task
from two perspectives: a bottom-up approach that predicts the next actions
autoregressively by modeling temporal dynamics; and a top-down approach that
infers the goal of the actor and plans the needed procedure to accomplish the
goal. We hypothesize that large language models (LLMs), which have been
pretrained on procedure text data (e.g. recipes, how-tos), have the potential
to help LTA from both perspectives. It can help provide the prior knowledge on
the possible next actions, and infer the goal given the observed part of a
procedure, respectively. To leverage the LLMs, we propose a two-stage
framework, AntGPT. It first recognizes the actions already performed in the
observed videos and then asks an LLM to predict the future actions via
conditioned generation, or to infer the goal and plan the whole procedure by
chain-of-thought prompting. Empirical results on the Ego4D LTA v1 and v2
benchmarks, EPIC-Kitchens-55, as well as EGTEA GAZE+ demonstrate the
effectiveness of our proposed approach. AntGPT achieves state-of-the-art
performance on all above benchmarks, and can successfully infer the goal and
thus perform goal-conditioned "counterfactual" prediction via qualitative
analysis. Code and model will be released at
https://brown-palm.github.io/AntGP
Inner Monologue: Embodied Reasoning through Planning with Language Models
Recent works have shown how the reasoning capabilities of Large Language
Models (LLMs) can be applied to domains beyond natural language processing,
such as planning and interaction for robots. These embodied problems require an
agent to understand many semantic aspects of the world: the repertoire of
skills available, how these skills influence the world, and how changes to the
world map back to the language. LLMs planning in embodied environments need to
consider not just what skills to do, but also how and when to do them - answers
that change over time in response to the agent's own choices. In this work, we
investigate to what extent LLMs used in such embodied contexts can reason over
sources of feedback provided through natural language, without any additional
training. We propose that by leveraging environment feedback, LLMs are able to
form an inner monologue that allows them to more richly process and plan in
robotic control scenarios. We investigate a variety of sources of feedback,
such as success detection, scene description, and human interaction. We find
that closed-loop language feedback significantly improves high-level
instruction completion on three domains, including simulated and real table top
rearrangement tasks and long-horizon mobile manipulation tasks in a kitchen
environment in the real world.Comment: Project website: https://innermonologue.github.i
Towards event analysis in time-series data: Asynchronous probabilistic models and learning from partial labels
In this thesis, we contribute in two main directions: modeling asynchronous time-series data and learning from partial labelled data. We first propose novel probabilistic frameworks to improve flexibility and expressiveness of current approaches in modeling complex real-world asynchronous event sequence data. Second, we present a scalable approach to end-to-end learn a deep multi-label classifier with partial labels. To evaluate the effectiveness of our proposed frameworks, we focus on visual recognition application, however, our proposed frameworks are generic and can be used in modeling general settings of learning event sequences, and learning multi-label classifiers from partial labels. Visual recognition is a fundamental piece for achieving machine intelligence, and has a wide range of applications such as human activity analysis, autonomous driving, surveillance and security, health-care monitoring, etc. With a wide range of experiments, we show that our proposed approaches help to build more powerful and effective visual recognition frameworks