7 research outputs found
LEAP: LLM-Generation of Egocentric Action Programs
We introduce LEAP (illustrated in Figure 1), a novel method for generating
video-grounded action programs through use of a Large Language Model (LLM).
These action programs represent the motoric, perceptual, and structural aspects
of action, and consist of sub-actions, pre- and post-conditions, and control
flows. LEAP's action programs are centered on egocentric video and employ
recent developments in LLMs both as a source for program knowledge and as an
aggregator and assessor of multimodal video information. We apply LEAP over a
majority (87\%) of the training set of the EPIC Kitchens dataset, and release
the resulting action programs as a publicly available dataset here
(https://drive.google.com/drive/folders/1Cpkw_TI1IIxXdzor0pOXG3rWJWuKU5Ex?usp=drive_link).
We employ LEAP as a secondary source of supervision, using its action programs
in a loss term applied to action recognition and anticipation networks. We
demonstrate sizable improvements in performance in both tasks due to training
with the LEAP dataset. Our method achieves 1st place on the EPIC Kitchens
Action Recognition leaderboard as of November 17 among the networks restricted
to RGB-input (see Supplementary Materials).Comment: Dataset:
https://drive.google.com/drive/folders/1Cpkw_TI1IIxXdzor0pOXG3rWJWuKU5Ex?usp=drive_lin
Feedback for Vision
Feedback plays a prominent role in biological vision, where perception is modulated based on agents' evolving expectations and world model. This is the case both in visually understanding the static structure of the world, as well as in modeling the dynamic structure of action. In this thesis we present first an approach to incorporating controlled feedback into image understanding, second an adaptation of this approach to action understanding, and lastly a notion of feedback in video monitoring. First, we introduce a novel mechanism which modulates perception based on high level categorical expectations: Mid-Vision Feedback (MVF). MVF associates high level contexts with linear transformations. When a context is "expected" its associated linear transformation is applied over feature vectors in a mid level of a network. The result is that mid-level network representations are biased towards conformance with high level expectations, improving overall accuracy and contextual consistency. Additionally, during training, mid-level feature vectors are biased through introduction of a loss term which increases the distance between feature vectors associated with different contexts. MVF is agnostic as to the source of contextual expectations, and can serve as a mechanism for top down integration of symbolic systems with deep vision architectures. We demonstrate the utility of MVF for object classification across three popular datasets and multiple architectures, including both Convolutional Neural Network architectures and a Transformer architecture.
We adapt MVF for action understanding with Sub-Action Modulation (SAM) for Video Networks. When humans interpret action they bring high level expectations of the context in which those actions are being performed. Along this thinking, we develop an approach to incorporating context into action understanding. Video segments are classified uniquely into a small set of action primitives (called Therbligs), which are grouped hierarchically into "Meta-Therbligs" as a context representation. SAM is an approach to first modeling Meta-Therbligs, and then incorporating expectation of Meta-Therbligs into mid-level processes through feedback. This allows the modulation of mid-level features in accordance with a temporally compositional representation of context. We show the superior performance of MVF to post-hoc filtering for incorporation of contextual knowledge, and show superior performance of configurations using predicted context (when no context is known a priori) over configurations with no context awareness. We demonstrate the utility of SAM over four popular video understanding architectures - I3D, MoViNet, TimeSFormer, and ViViT. Experiments over EPIC Kitchens and 50 Salads on the tasks of action recognition \& anticipation demonstrate SAM produces superior accuracies across all models, tasks, and datasets with minimal architectural alterations.
Lastly, we consider a notion of “feedback” where high level expectations, or specifications, are provided by human operators, allowing integration of humans into the perceptual loop . This is important for interfacing with humans, as perceptual tasks which are conventionally left entirely to human labor are increasingly (yet, thus, imperfectly) automated. We consider the task of surveillance. Security watchstanders who monitor multiple videos over long periods of time can be susceptible to information overload and fatigue. To address this, we present a configurable perception pipeline architecture, called the {\it Image Surveillance Assistant} (ISA), for assisting watchstanders with video surveillance tasks. We also present ISA, an initial implementation that can be configured with a set of {\em context specifications} which watchstanders can select or provide to indicate what imagery should generate notifications. ISA's inputs include (1) an image and (2) context specifications, which contain English sentences and a decision boundary defined over object detection vectors. ISA assesses the match of the image with the contexts by comparing (1) detected versus specified objects and (2) automatically-generated versus specified captions. Finally, we present a study to assess the utility of using captions in ISA, and found that they substantially improve the performance of image context detection.
Finally, notions of context and the contrast used to separate context for better manipulation in the above feedback work can be of benefit not only to feedback architectures, but within feed-forward architectures as well. We apply this intuition to the task of action understanding in video, where input is separated into motion and ``context''. Motivated by Goldman's Theory of Human Action - a framework in which action decomposes into 1) base physical movements, and 2) the context in which they occur - we propose a novel learning formulation for motion and context, where context is derived as the complement to motion. More specifically, we model physical movement through the adoption of Therbligs, a set of elemental physical motions centered around object manipulation. Context is modeled through the use of a contrastive mutual information loss that formulates context information as the action information not contained within movement information. We empirically prove the utility brought by this separation of representation, showing sizable improvements in action recognition and action anticipation accuracies for a variety of models. We present results over two object manipulation datasets: EPIC Kitchens 100, and 50 Salads
An Image-To-Speech iPad App
We describe an iPad app which assists in language acquisition and development. Such an application can be used by clinicians for human developmental disabilities. A user drags images around on the screen. The app generates and speaks random (but sensible) phrases that matches the image interact. For example, if a user drags an image of a squirrel onto an image of a tree, the app may say "the squirrel ran up the tree." A key challenge is the automated creation of "sensible" English phrases, which we solve by using a large corpus and machine learning