61 research outputs found
High-for-Low and Low-for-High: Efficient Boundary Detection from Deep Object Features and its Applications to High-Level Vision
Most of the current boundary detection systems rely exclusively on low-level
features, such as color and texture. However, perception studies suggest that
humans employ object-level reasoning when judging if a particular pixel is a
boundary. Inspired by this observation, in this work we show how to predict
boundaries by exploiting object-level features from a pretrained
object-classification network. Our method can be viewed as a "High-for-Low"
approach where high-level object features inform the low-level boundary
detection process. Our model achieves state-of-the-art performance on an
established boundary detection benchmark and it is efficient to run.
Additionally, we show that due to the semantic nature of our boundaries we
can use them to aid a number of high-level vision tasks. We demonstrate that
using our boundaries we improve the performance of state-of-the-art methods on
the problems of semantic boundary labeling, semantic segmentation and object
proposal generation. We can view this process as a "Low-for-High" scheme, where
low-level boundaries aid high-level vision tasks.
Thus, our contributions include a boundary detection system that is accurate,
efficient, generalizes well to multiple datasets, and is also shown to improve
existing state-of-the-art high-level vision methods on three distinct tasks
Embodied Visual Perception Models For Human Behavior Understanding
Many modern applications require extracting the core attributes of human behavior such as a person\u27s attention, intent, or skill level from the visual data. There are two main challenges related to this problem. First, we need models that can represent visual data in terms of object-level cues. Second, we need models that can infer the core behavioral attributes from the visual data. We refer to these two challenges as ``learning to see\u27\u27, and ``seeing to learn\u27\u27 respectively. In this PhD thesis, we have made progress towards addressing both challenges.
We tackle the problem of ``learning to see\u27\u27 by developing methods that extract object-level information directly from raw visual data. This includes, two top-down contour detectors, DeepEdge and HfL, which can be used to aid high-level vision tasks such as object detection. Furthermore, we also present two semantic object segmentation methods, Boundary Neural Fields (BNFs), and Convolutional Random Walk Networks (RWNs), which integrate low-level affinity cues into an object segmentation process. We then shift our focus to video-level understanding, and present a Spatiotemporal Sampling Network (STSN), which can be used for video object detection, and discriminative motion feature learning.
Afterwards, we transition into the second subproblem of ``seeing to learn\u27\u27, for which we leverage first-person GoPro cameras that record what people see during a particular activity. We aim to infer the core behavior attributes such as a person\u27s attention, intention, and his skill level from such first-person data. To do so, we first propose a concept of action-objects--the objects that capture person\u27s conscious visual (watching a TV) or tactile (taking a cup) interactions. We then introduce two models, EgoNet and Visual-Spatial Network (VSN), which detect action-objects in supervised and unsupervised settings respectively. Afterwards, we focus on a behavior understanding task in a complex basketball activity. We present a method for evaluating players\u27 skill level from their first-person basketball videos, and also a model that predicts a player\u27s future motion trajectory from a single first-person image
TALLFormer: Temporal Action Localization with Long-memory Transformer
Most modern approaches in temporal action localization divide this problem
into two parts: (i) short-term feature extraction and (ii) long-range temporal
boundary localization. Due to the high GPU memory cost caused by processing
long untrimmed videos, many methods sacrifice the representational power of the
short-term feature extractor by either freezing the backbone or using a very
small spatial video resolution. This issue becomes even worse with the recent
video transformer models, many of which have quadratic memory complexity. To
address these issues, we propose TALLFormer, a memory-efficient and end-to-end
trainable Temporal Action Localization transformer with Long-term memory. Our
long-term memory mechanism eliminates the need for processing hundreds of
redundant video frames during each training iteration, thus, significantly
reducing the GPU memory consumption and training time. These efficiency savings
allow us (i) to use a powerful video transformer-based feature extractor
without freezing the backbone or reducing the spatial video resolution, while
(ii) also maintaining long-range temporal boundary localization capability.
With only RGB frames as input and no external action recognition classifier,
TALLFormer outperforms previous state-of-the-art methods by a large margin,
achieving an average mAP of 59.1% on THUMOS14 and 35.6% on ActivityNet-1.3. The
code will be available in https://github.com/klauscc/TALLFormer.Comment: 15 pages, 2 figure
- …