82,398 research outputs found
Ridiculously Fast Shot Boundary Detection with Fully Convolutional Neural Networks
Shot boundary detection (SBD) is an important component of many video
analysis tasks, such as action recognition, video indexing, summarization and
editing. Previous work typically used a combination of low-level features like
color histograms, in conjunction with simple models such as SVMs. Instead, we
propose to learn shot detection end-to-end, from pixels to final shot
boundaries. For training such a model, we rely on our insight that all shot
boundaries are generated. Thus, we create a dataset with one million frames and
automatically generated transitions such as cuts, dissolves and fades. In order
to efficiently analyze hours of videos, we propose a Convolutional Neural
Network (CNN) which is fully convolutional in time, thus allowing to use a
large temporal context without the need to repeatedly processing frames. With
this architecture our method obtains state-of-the-art results while running at
an unprecedented speed of more than 120x real-time
Single Shot Temporal Action Detection
Temporal action detection is a very important yet challenging problem, since
videos in real applications are usually long, untrimmed and contain multiple
action instances. This problem requires not only recognizing action categories
but also detecting start time and end time of each action instance. Many
state-of-the-art methods adopt the "detection by classification" framework:
first do proposal, and then classify proposals. The main drawback of this
framework is that the boundaries of action instance proposals have been fixed
during the classification step. To address this issue, we propose a novel
Single Shot Action Detector (SSAD) network based on 1D temporal convolutional
layers to skip the proposal generation step via directly detecting action
instances in untrimmed video. On pursuit of designing a particular SSAD network
that can work effectively for temporal action detection, we empirically search
for the best network architecture of SSAD due to lacking existing models that
can be directly adopted. Moreover, we investigate into input feature types and
fusion strategies to further improve detection accuracy. We conduct extensive
experiments on two challenging datasets: THUMOS 2014 and MEXaction2. When
setting Intersection-over-Union threshold to 0.5 during evaluation, SSAD
significantly outperforms other state-of-the-art systems by increasing mAP from
19.0% to 24.6% on THUMOS 2014 and from 7.4% to 11.0% on MEXaction2.Comment: ACM Multimedia 201
Temporal Action Localization with Enhanced Instant Discriminability
Temporal action detection (TAD) aims to detect all action boundaries and
their corresponding categories in an untrimmed video. The unclear boundaries of
actions in videos often result in imprecise predictions of action boundaries by
existing methods. To resolve this issue, we propose a one-stage framework named
TriDet. First, we propose a Trident-head to model the action boundary via an
estimated relative probability distribution around the boundary. Then, we
analyze the rank-loss problem (i.e. instant discriminability deterioration) in
transformer-based methods and propose an efficient scalable-granularity
perception (SGP) layer to mitigate this issue. To further push the limit of
instant discriminability in the video backbone, we leverage the strong
representation capability of pretrained large models and investigate their
performance on TAD. Last, considering the adequate spatial-temporal context for
classification, we design a decoupled feature pyramid network with separate
feature pyramids to incorporate rich spatial context from the large model for
localization. Experimental results demonstrate the robustness of TriDet and its
state-of-the-art performance on multiple TAD datasets, including hierarchical
(multilabel) TAD datasets.Comment: An extended version of the CVPR paper arXiv:2303.07347, submitted to
IJC
Recommended from our members
Deep Learning for Action Understanding in Video
Action understanding is key to automatically analyzing video content and thus is important for many real-world applications such as autonomous driving car, robot-assisted care, etc. Therefore, in the computer vision field, action understanding has been one of the fundamental research topics. Most conventional methods for action understanding are based on hand-crafted features. Like the recent advances seen in image classification, object detection, image captioning, etc, deep learning has become a popular approach for action understanding in video. However, there remain several important research challenges in developing deep learning based methods for understanding actions. This thesis focuses on the development of effective deep learning methods for solving three major challenges.
Action detection at fine granularities in time: Previous work in deep learning based action understanding mainly focuses on exploring various backbone networks that are designed for the video-level action classification task. These did not explore the fine-grained temporal characteristics and thus failed to produce temporally precise estimation of action boundaries. In order to understand actions more comprehensively, it is important to detect actions at finer granularities in time. In Part I, we study both segment-level action detection and frame-level action detection. Segment-level action detection is usually formulated as the temporal action localization task, which requires not only recognizing action categories for the whole video but also localizing the start time and end time of each action instance. To this end, we propose an effective multi-stage framework called Segment-CNN consisting of three segment-based 3D ConvNets: (1) a proposal network identifies candidate segments that may contain actions; (2) a classification network learns one-vs-all action classification model to serve as initialization for the localization network; and (3) a localization network fine-tunes the learned classification network to localize each action instance. In another approach, frame-level action detection is effectively formulated as the per-frame action labeling task. We combine two reverse operations (i.e. convolution and deconvolution) into a joint Convolutional-De-Convolutional (CDC) filter, which simultaneously conducts downsampling in space and upsampling in time to jointly model both high-level semantics and temporal dynamics. We design a novel CDC network to predict actions at frame-level and the frame-level predictions can be further used to detect precise segment boundary for the temporal action localization task. Our method not only improves the state-of-the-art mean Average Precision (mAP) result on THUMOS’14 from 41.3% to 44.4% for the per-frame labeling task, but also improves mAP for the temporal action localization task from 19.0% to 23.3% on THUMOS’14 and from 16.4% to 23.8% on ActivityNet v1.3.
Action detection in the constrained scenarios: The usual training process of deep learning models consists of supervision and data, which are not always available in reality. In Part II, we consider the scenarios of incomplete supervision and incomplete data. For incomplete supervision, we focus on the weakly-supervised temporal action localization task and propose AutoLoc which is the first framework that can directly predict the temporal boundary of each action instance with only the video-level annotations available during training. To enable the training of such a boundary prediction model, we design a novel Outer-Inner-Contrastive (OIC) loss to help discover the segment-level supervision and we prove that the OIC loss is differentiable to the underlying boundary prediction model. Our method significantly improves mAP on THUMOS14 from 13.7% to 21.2% and mAP on ActivityNet from 7.4% to 27.3%. For the scenario of incomplete data, we formulate a novel task called Online Detection of Action Start (ODAS) in streaming videos to enable detecting the action start time on the fly when a live video action is just starting. ODAS is important in many applications such as early alert generation to allow timely security or emergency response. Specifically, we propose three novel methods to address the challenges in training ODAS models: (1) hard negative samples generation based on Generative Adversarial Network (GAN) to distinguish ambiguous background, (2) explicitly modeling the temporal consistency between data around action start and data succeeding action start, and (3) adaptive sampling strategy to handle the scarcity of training data.
Action understanding in the compressed domain: The mainstream action understanding methods including the aforementioned techniques developed by us require first decoding the compressed video into RGB image frames. This may result in significant cost in terms of storage and computation. Recently, researchers started to investigate how to directly perform action understanding in the compressed domain in order to achieve high efficiency while maintaining the state-of-the-art action detection accuracy. The key research challenge is developing effective backbone networks that can directly take data in the compressed domain as input. Our baseline is to take models developed for action understanding in the decoded domain and adapt them to attack the same tasks in the compressed domain. In Part III, we address two important issues in developing the backbone networks that exclusively operate in the compressed domain. First, compressed videos may be produced by different encoders or encoding parameters, but it is impractical to train a different compressed-domain action understanding model for each different format. We experimentally analyze the effect of video encoder variation and develop a simple yet effective training data preparation method to alleviate the sensitivity to encoder variation. Second, motion cues have been shown to be important for action understanding, but the motion vectors in compressed video are often very noisy and not discriminative enough for directly performing accurate action understanding. We develop a novel and highly efficient framework called DMC-Net that can learn to predict discriminative motion cues based on noisy motion vectors and residual errors in the compressed video streams. On three action recognition benchmarks, namely HMDB-51, UCF101 and a subset of Kinetics, we demonstrate that our DMC-Net can significantly shorten the performance gap between state-of-the-art compressed video based methods with and without optical flow, while being two orders of magnitude faster than the methods that use optical flow.
By addressing the three major challenges mentioned above, we are able to develop more robust models for video action understanding and improve performance in various dimensions, such as (1) temporal precision, (2) required levels of supervision, (3) live video analysis ability, and finally (4) efficiency in processing compressed video. Our research has contributed significantly to advancing the state of the art of video action understanding and expanding the foundation for comprehensive semantic understanding of video content
- …