315 research outputs found
Cross-Video Contextual Knowledge Exploration and Exploitation for Ambiguity Reduction in Weakly Supervised Temporal Action Localization
Weakly supervised temporal action localization (WSTAL) aims to localize
actions in untrimmed videos using video-level labels. Despite recent advances,
existing approaches mainly follow a localization-by-classification pipeline,
generally processing each segment individually, thereby exploiting only limited
contextual information. As a result, the model will lack a comprehensive
understanding (e.g. appearance and temporal structure) of various action
patterns, leading to ambiguity in classification learning and temporal
localization. Our work addresses this from a novel perspective, by exploring
and exploiting the cross-video contextual knowledge within the dataset to
recover the dataset-level semantic structure of action instances via weak
labels only, thereby indirectly improving the holistic understanding of
fine-grained action patterns and alleviating the aforementioned ambiguities.
Specifically, an end-to-end framework is proposed, including a Robust
Memory-Guided Contrastive Learning (RMGCL) module and a Global Knowledge
Summarization and Aggregation (GKSA) module. First, the RMGCL module explores
the contrast and consistency of cross-video action features, assisting in
learning more structured and compact embedding space, thus reducing ambiguity
in classification learning. Further, the GKSA module is used to efficiently
summarize and propagate the cross-video representative action knowledge in a
learnable manner to promote holistic action patterns understanding, which in
turn allows the generation of high-confidence pseudo-labels for self-learning,
thus alleviating ambiguity in temporal localization. Extensive experiments on
THUMOS14, ActivityNet1.3, and FineAction demonstrate that our method
outperforms the state-of-the-art methods, and can be easily plugged into other
WSTAL methods.Comment: Submitted to TCSVT. 14 pages and 7 figure
Spatiotemporal Event Graphs for Dynamic Scene Understanding
Dynamic scene understanding is the ability of a computer system to interpret and make sense of the visual information present in a video of a real-world scene. In this thesis, we present a series of frameworks for dynamic scene understanding starting from road event detection from an autonomous driving perspective to complex video activity detection, followed by continual learning approaches for the life-long learning of the models. Firstly, we introduce the ROad event Awareness Dataset (ROAD) for Autonomous Driving, to our knowledge the first of its kind. ROAD is designed to test an autonomous vehicle’s ability to detect road events, defined as triplets composed by an active agent, the action(s) it performs and the corresponding scene locations. Due to the lack of datasets equipped with formally specified logical requirements, we also introduce the ROad event Awareness Dataset with logical Requirements (ROAD-R), the first publicly available dataset for autonomous driving with requirements expressed as logical constraints, as a tool for driving neurosymbolic research in the area.
Next, we extend event detection to holistic scene understanding by proposing two complex activity detection methods. In the first method, we present a deformable, spatiotemporal scene graph approach, consisting of three main building blocks: action tube detection, a 3D deformable RoI pooling layer designed for learning the flexible, deformable geometry of the constituent action tubes, and a scene graph constructed by considering all parts as nodes and connecting them based on different semantics. In a second approach evolving from the first, we propose a hybrid graph neural network that combines attention applied to a graph encoding of the local (short-term) dynamic scene with a temporal graph modelling the overall long-duration activity. Our contribution is threefold: i) a feature extraction technique; ii) a method for constructing a local scene graph followed by graph attention, and iii) a graph for temporally connecting all the local dynamic scene graphs.
Finally, the last part of the thesis is about presenting a new continual semi-supervised learning (CSSL) paradigm, proposed to the attention of the machine learning community. We also propose to formulate the continual semi-supervised learning problem as a latent-variable
LOOKING INTO ACTORS, OBJECTS AND THEIR INTERACTIONS FOR VIDEO UNDERSTANDING
Automatic video understanding is critical for enabling new applications in video surveillance, augmented reality, and beyond. Powered by deep networks that learn holistic representations of video clips, and large-scale annotated datasets, modern systems are capable of accurately recognizing hundreds of human activity classes. However, their performance significantly degrades as the number of actors in the scene or the complexity of the activities increases. Therefore, most of the research thus far has focused on videos that are short and/or contain a few activities performed only by adults. Furthermore, most current systems require expensive, spatio-temporal annotations for training. These limitations prevent the deployment of such systems in real-life applications, such as detecting activities of people and vehicles in an extended surveillance videos.
To address these limitations, this thesis focuses on developing data-driven, compositional, region-based video understanding models motivated by the observation that actors, objects and their spatio-temporal interactions are the building blocks of activities and the main content of video descriptions provided by humans. This thesis makes three main contributions. First, we propose a novel Graph Neural Network for representation learning on heterogeneous graphs that encode spatio-temporal interactions between actor and object regions in videos. This model can learn context-aware representations for detected actors and objects, which we leverage for detecting complex activities. Second, we propose an attention-based deep conditional generative model of sentences, whose latent variables correspond to alignments between words in textual descriptions of videos and object regions. Building upon the framework of Conditional Variational Autoencoders, we train this model using only textual descriptions without bounding box annotations, and leverage its latent variables for localizing the actors and objects that are mentioned in generated or ground-truth descriptions of videos. Finally, we propose an actor-centric framework for real-time activity detection in videos that are extended both in space and time. Our framework leverages object detections and tracking to generate actor-centric tubelets, capturing all relevant spatio-temporal context for a single actor, and detects activities per tubelet based on contextual region embeddings. The models described have demonstrably improved the ability to temporally detect activities, as well as ground words in visual inputs
How transferable are video representations based on synthetic data?
Army Research Office; CCF-2007350 - National Science Foundation; CCF-1955981 - National Science Foundationhttps://openreview.net/pdf?id=lRUCfzs5Hz
A Video-based End-to-end Pipeline for Non-nutritive Sucking Action Recognition and Segmentation in Young Infants
We present an end-to-end computer vision pipeline to detect non-nutritive
sucking (NNS) -- an infant sucking pattern with no nutrition delivered -- as a
potential biomarker for developmental delays, using off-the-shelf baby monitor
video footage. One barrier to clinical (or algorithmic) assessment of NNS stems
from its sparsity, requiring experts to wade through hours of footage to find
minutes of relevant activity. Our NNS activity segmentation algorithm solves
this problem by identifying periods of NNS with high certainty -- up to 94.0\%
average precision and 84.9\% average recall across 30 heterogeneous 60 s clips,
drawn from our manually annotated NNS clinical in-crib dataset of 183 hours of
overnight baby monitor footage from 19 infants. Our method is based on an
underlying NNS action recognition algorithm, which uses spatiotemporal deep
learning networks and infant-specific pose estimation, achieving 94.9\%
accuracy in binary classification of 960 2.5 s balanced NNS vs. non-NNS clips.
Tested on our second, independent, and public NNS in-the-wild dataset, NNS
recognition classification reaches 92.3\% accuracy, and NNS segmentation
achieves 90.8\% precision and 84.2\% recall
Efficient Online Processing with Deep Neural Networks
The capabilities and adoption of deep neural networks (DNNs) grow at an
exhilarating pace: Vision models accurately classify human actions in videos
and identify cancerous tissue in medical scans as precisely than human experts;
large language models answer wide-ranging questions, generate code, and write
prose, becoming the topic of everyday dinner-table conversations. Even though
their uses are exhilarating, the continually increasing model sizes and
computational complexities have a dark side. The economic cost and negative
environmental externalities of training and serving models is in evident
disharmony with financial viability and climate action goals.
Instead of pursuing yet another increase in predictive performance, this
dissertation is dedicated to the improvement of neural network efficiency.
Specifically, a core contribution addresses the efficiency aspects during
online inference. Here, the concept of Continual Inference Networks (CINs) is
proposed and explored across four publications. CINs extend prior
state-of-the-art methods developed for offline processing of spatio-temporal
data and reuse their pre-trained weights, improving their online processing
efficiency by an order of magnitude. These advances are attained through a
bottom-up computational reorganization and judicious architectural
modifications. The benefit to online inference is demonstrated by reformulating
several widely used network architectures into CINs, including 3D CNNs,
ST-GCNs, and Transformer Encoders. An orthogonal contribution tackles the
concurrent adaptation and computational acceleration of a large source model
into multiple lightweight derived models. Drawing on fusible adapter networks
and structured pruning, Structured Pruning Adapters achieve superior predictive
accuracy under aggressive pruning using significantly fewer learned weights
compared to fine-tuning with pruning.Comment: PhD Dissertatio
- …