53,344 research outputs found
Point-wise mutual information-based video segmentation with high temporal consistency
In this paper, we tackle the problem of temporally consistent boundary
detection and hierarchical segmentation in videos. While finding the best
high-level reasoning of region assignments in videos is the focus of much
recent research, temporal consistency in boundary detection has so far only
rarely been tackled. We argue that temporally consistent boundaries are a key
component to temporally consistent region assignment. The proposed method is
based on the point-wise mutual information (PMI) of spatio-temporal voxels.
Temporal consistency is established by an evaluation of PMI-based point
affinities in the spectral domain over space and time. Thus, the proposed
method is independent of any optical flow computation or previously learned
motion models. The proposed low-level video segmentation method outperforms the
learning-based state of the art in terms of standard region metrics
Video-SwinUNet: Spatio-temporal Deep Learning Framework for VFSS Instance Segmentation
This paper presents a deep learning framework for medical video segmentation.
Convolution neural network (CNN) and transformer-based methods have achieved
great milestones in medical image segmentation tasks due to their incredible
semantic feature encoding and global information comprehension abilities.
However, most existing approaches ignore a salient aspect of medical video data
- the temporal dimension. Our proposed framework explicitly extracts features
from neighbouring frames across the temporal dimension and incorporates them
with a temporal feature blender, which then tokenises the high-level
spatio-temporal feature to form a strong global feature encoded via a Swin
Transformer. The final segmentation results are produced via a UNet-like
encoder-decoder architecture. Our model outperforms other approaches by a
significant margin and improves the segmentation benchmarks on the VFSS2022
dataset, achieving a dice coefficient of 0.8986 and 0.8186 for the two datasets
tested. Our studies also show the efficacy of the temporal feature blending
scheme and cross-dataset transferability of learned capabilities. Code and
models are fully available at https://github.com/SimonZeng7108/Video-SwinUNet
Time-Space Transformers for Video Panoptic Segmentation
We propose a novel solution for the task of video panoptic segmentation, that
simultaneously predicts pixel-level semantic and instance segmentation and
generates clip-level instance tracks. Our network, named VPS-Transformer, with
a hybrid architecture based on the state-of-the-art panoptic segmentation
network Panoptic-DeepLab, combines a convolutional architecture for
single-frame panoptic segmentation and a novel video module based on an
instantiation of the pure Transformer block. The Transformer, equipped with
attention mechanisms, models spatio-temporal relations between backbone output
features of current and past frames for more accurate and consistent panoptic
estimates. As the pure Transformer block introduces large computation overhead
when processing high resolution images, we propose a few design changes for a
more efficient compute. We study how to aggregate information more effectively
over the space-time volume and we compare several variants of the Transformer
block with different attention schemes. Extensive experiments on the
Cityscapes-VPS dataset demonstrate that our best model improves the temporal
consistency and video panoptic quality by a margin of 2.2%, with little extra
computation
An investigation into image and video foreground segmentation and change detection
Detecting and segmenting Spatio-temporal foreground objects from videos are significant to motion pattern modelling and video content analysis. Extensive efforts have been made in the past decades. Nevertheless, video-based saliency detection and foreground segmentation remained challenging. On the one hand, the performances of image-based saliency detection algorithms are limited in complex contents, while the temporal connectivity between frames are not well-resolved. On the other hand, compared with the prosperous image-based datasets, the datasets in video-level saliency detection and segmentation usually have smaller scale and less diversity of contents. Towards a better understanding of video-level semantics, this thesis investigates the foreground estimation and segmentation in both image-level and video-level.
This thesis firstly demonstrates the effectiveness of traditional features in video foreground estimation and segmentation. Motion patterns obtained by optical flow are utilised to draw coarse estimations about the foreground objects. The coarse estimations are refined by aligning motion boundaries with actual contours of the foreground objects with the participation of HOG descriptor. And a precise segmentation of the foreground is computed based on the refined foreground estimations and video-level colour distribution.
Second, a deep convolutional neural network (CNN) for image saliency detection is proposed, which is named HReSNet. To improve the accuracy of saliency prediction, an independent feature refining network is implemented. A Euclidean distance loss is integrated into loss computation to enhance the saliency predictions near the contours of objects. The experimental results demonstrate that our network obtains competitive results compared with the state-of-art algorithms.
Third, a large-scale dataset for video saliency detection and foreground segmentation is built to enrich the diversity of current video-based foreground segmentation datasets. A supervised framework is also proposed as the baseline, which integrates our HReSNet, Long-Short Term Memory (LSTM) networks and a hierarchical segmentation network.
Forth, in the practice of change detection, there requires distinguishing the expected changes with semantics from the unexpected changes. Therefore, a new CNN design is proposed to detect changes in multi-temporal high-resolution urban images. Experimental results showed our change detection network outperformed the competing algorithms with significant advantages
BIT: Bi-Level Temporal Modeling for Efficient Supervised Action Segmentation
We address the task of supervised action segmentation which aims to partition
a video into non-overlapping segments, each representing a different action.
Recent works apply transformers to perform temporal modeling at the
frame-level, which suffer from high computational cost and cannot well capture
action dependencies over long temporal horizons. To address these issues, we
propose an efficient BI-level Temporal modeling (BIT) framework that learns
explicit action tokens to represent action segments, in parallel performs
temporal modeling on frame and action levels, while maintaining a low
computational cost. Our model contains (i) a frame branch that uses convolution
to learn frame-level relationships, (ii) an action branch that uses transformer
to learn action-level dependencies with a small set of action tokens and (iii)
cross-attentions to allow communication between the two branches. We apply and
extend a set-prediction objective to allow each action token to represent one
or multiple action segments, thus can avoid learning a large number of tokens
over long videos with many segments. Thanks to the design of our action branch,
we can also seamlessly leverage textual transcripts of videos (when available)
to help action segmentation by using them to initialize the action tokens. We
evaluate our model on four video datasets (two egocentric and two third-person)
for action segmentation with and without transcripts, showing that BIT
significantly improves the state-of-the-art accuracy with much lower
computational cost (30 times faster) compared to existing transformer-based
methods.Comment: 9 pages, 6 figure
A Deep Motion Vector Approach to Video Object Segmentation
Video object segmentation is gaining increased research and commercial importance in recent times from no checkout lines in Amazon Go stores to autonomous vehicles operating on roads. Efficient operation for such use cases require segmentation inference in real time. Even though there has been significant research in image segmentation, both semantic and instance, there is still much scope for improvement in video segmentation. Video seg-mentation is a direct extension of image segmentation, except that there is temporal relation between neighboring frames of videos. Exploiting this temporal relation in an efficient way is one of the most important challenges in video segmentation. This temporal relation has a lot of redundancy involved and many of the prevalent state-of-the-art techniques do not exploit this redundancy.
Optical flow is one of the approaches for exploiting temporal redundancies. Intermediate feature maps of previous frames are interpolated using this information and rest of the segmentation operation is performed. However, optical flow provides motion resolution on a pixel level. There is not enough motion between consecutive frames to warrant motion estimation on pixel level. Instead we can divide a frame into multiple blocks and estimate the movement of their centroids in consecutive video frames.
Based on this idea, we present a motion vector approach to video semantic segmentation. Additionally, we also propose an adaptive technique to select keyframes during inference. We show that our proposed algorithm can bring down the computational complexity during inference by as much as 50% with only a 2-3% drop in the accuracy metric. Our algorithm can operate at as high as 136 frames per second indicating that it can easily handle real time inference
- …