922 research outputs found
LOOKING INTO ACTORS, OBJECTS AND THEIR INTERACTIONS FOR VIDEO UNDERSTANDING
Automatic video understanding is critical for enabling new applications in video surveillance, augmented reality, and beyond. Powered by deep networks that learn holistic representations of video clips, and large-scale annotated datasets, modern systems are capable of accurately recognizing hundreds of human activity classes. However, their performance significantly degrades as the number of actors in the scene or the complexity of the activities increases. Therefore, most of the research thus far has focused on videos that are short and/or contain a few activities performed only by adults. Furthermore, most current systems require expensive, spatio-temporal annotations for training. These limitations prevent the deployment of such systems in real-life applications, such as detecting activities of people and vehicles in an extended surveillance videos.
To address these limitations, this thesis focuses on developing data-driven, compositional, region-based video understanding models motivated by the observation that actors, objects and their spatio-temporal interactions are the building blocks of activities and the main content of video descriptions provided by humans. This thesis makes three main contributions. First, we propose a novel Graph Neural Network for representation learning on heterogeneous graphs that encode spatio-temporal interactions between actor and object regions in videos. This model can learn context-aware representations for detected actors and objects, which we leverage for detecting complex activities. Second, we propose an attention-based deep conditional generative model of sentences, whose latent variables correspond to alignments between words in textual descriptions of videos and object regions. Building upon the framework of Conditional Variational Autoencoders, we train this model using only textual descriptions without bounding box annotations, and leverage its latent variables for localizing the actors and objects that are mentioned in generated or ground-truth descriptions of videos. Finally, we propose an actor-centric framework for real-time activity detection in videos that are extended both in space and time. Our framework leverages object detections and tracking to generate actor-centric tubelets, capturing all relevant spatio-temporal context for a single actor, and detects activities per tubelet based on contextual region embeddings. The models described have demonstrably improved the ability to temporally detect activities, as well as ground words in visual inputs
GRATIS: Deep Learning Graph Representation with Task-specific Topology and Multi-dimensional Edge Features
Graph is powerful for representing various types of real-world data. The
topology (edges' presence) and edges' features of a graph decides the message
passing mechanism among vertices within the graph. While most existing
approaches only manually define a single-value edge to describe the
connectivity or strength of association between a pair of vertices,
task-specific and crucial relationship cues may be disregarded by such manually
defined topology and single-value edge features. In this paper, we propose the
first general graph representation learning framework (called GRATIS) which can
generate a strong graph representation with a task-specific topology and
task-specific multi-dimensional edge features from any arbitrary input. To
learn each edge's presence and multi-dimensional feature, our framework takes
both of the corresponding vertices pair and their global contextual information
into consideration, enabling the generated graph representation to have a
globally optimal message passing mechanism for different down-stream tasks. The
principled investigation results achieved for various graph analysis tasks on
11 graph and non-graph datasets show that our GRATIS can not only largely
enhance pre-defined graphs but also learns a strong graph representation for
non-graph data, with clear performance improvements on all tasks. In
particular, the learned topology and multi-dimensional edge features provide
complementary task-related cues for graph analysis tasks. Our framework is
effective, robust and flexible, and is a plug-and-play module that can be
combined with different backbones and Graph Neural Networks (GNNs) to generate
a task-specific graph representation from various graph and non-graph data. Our
code is made publicly available at
https://github.com/SSYSteve/Learning-Graph-Representation-with-Task-specific-Topology-and-Multi-dimensional-Edge-Features
- …