74,333 research outputs found
OVSNet : Towards One-Pass Real-Time Video Object Segmentation
Video object segmentation aims at accurately segmenting the target object
regions across consecutive frames. It is technically challenging for coping
with complicated factors (e.g., shape deformations, occlusion and out of the
lens). Recent approaches have largely solved them by using backforth
re-identification and bi-directional mask propagation. However, their methods
are extremely slow and only support offline inference, which in principle
cannot be applied in real time. Motivated by this observation, we propose a
efficient detection-based paradigm for video object segmentation. We propose an
unified One-Pass Video Segmentation framework (OVS-Net) for modeling
spatial-temporal representation in a unified pipeline, which seamlessly
integrates object detection, object segmentation, and object re-identification.
The proposed framework lends itself to one-pass inference that effectively and
efficiently performs video object segmentation. Moreover, we propose a
maskguided attention module for modeling the multi-scale object boundary and
multi-level feature fusion. Experiments on the challenging DAVIS 2017
demonstrate the effectiveness of the proposed framework with comparable
performance to the state-of-the-art, and the great efficiency about 11.5 FPS
towards pioneering real-time work to our knowledge, more than 5 times faster
than other state-of-the-art methods.Comment: 10 pages, 6 figure
MediViSTA-SAM: Zero-shot Medical Video Analysis with Spatio-temporal SAM Adaptation
In recent years, the Segmentation Anything Model (SAM) has attracted
considerable attention as a foundational model well-known for its robust
generalization capabilities across various downstream tasks. However, SAM does
not exhibit satisfactory performance in the realm of medical image analysis. In
this study, we introduce the first study on adapting SAM on video segmentation,
called MediViSTA-SAM, a novel approach designed for medical video segmentation.
Given video data, MediViSTA, spatio-temporal adapter captures long and short
range temporal attention with cross-frame attention mechanism effectively
constraining it to consider the immediately preceding video frame as a
reference, while also considering spatial information effectively.
Additionally, it incorporates multi-scale fusion by employing a U-shaped
encoder and a modified mask decoder to handle objects of varying sizes. To
evaluate our approach, extensive experiments were conducted using
state-of-the-art (SOTA) methods, assessing its generalization abilities on
multi-vendor in-house echocardiography datasets. The results highlight the
accuracy and effectiveness of our network in medical video segmentation
Experiments on lecturer segmentation using texture classification and a 3D camera
In our system for recording and transmitting lectures over the Internet the
board content is sent as vector graphics, yielding a high quality image, while
the video of the lecturer is sent as a separate stream. It is easy for the
viewer to read the board, but the lecturer appears in a separate window. To
eliminate this problem, we segment the lecturer from the video stream and
paste his image on the board image at video stream rates. The lecturer can be
dimmed by the remote viewer from opaque to semitransparent, or even
transparent. This paper explains the two techniques we apply to achieve this:
texture classification based segmentation, and segmentation using a novel 3D
camera based on the time-of-flight of backscattered light principle. We argue
that this technique provides a solution to the divided attention problem which
arises when board and lecturer are transmitted in two different streams
Robotic Scene Segmentation with Memory Network for Runtime Surgical Context Inference
Surgical context inference has recently garnered significant attention in
robot-assisted surgery as it can facilitate workflow analysis, skill
assessment, and error detection. However, runtime context inference is
challenging since it requires timely and accurate detection of the interactions
among the tools and objects in the surgical scene based on the segmentation of
video data. On the other hand, existing state-of-the-art video segmentation
methods are often biased against infrequent classes and fail to provide
temporal consistency for segmented masks. This can negatively impact the
context inference and accurate detection of critical states. In this study, we
propose a solution to these challenges using a Space Time Correspondence
Network (STCN). STCN is a memory network that performs binary segmentation and
minimizes the effects of class imbalance. The use of a memory bank in STCN
allows for the utilization of past image and segmentation information, thereby
ensuring consistency of the masks. Our experiments using the publicly available
JIGSAWS dataset demonstrate that STCN achieves superior segmentation
performance for objects that are difficult to segment, such as needle and
thread, and improves context inference compared to the state-of-the-art. We
also demonstrate that segmentation and context inference can be performed at
runtime without compromising performance.Comment: accepted at The IEEE/RSJ International Conference on Intelligent
Robots and Systems (IROS) 202
Attention is All We Need: Nailing Down Object-centric Attention for Egocentric Activity Recognition
In this paper we propose an end-to-end trainable deep neural network model
for egocentric activity recognition. Our model is built on the observation that
egocentric activities are highly characterized by the objects and their
locations in the video. Based on this, we develop a spatial attention mechanism
that enables the network to attend to regions containing objects that are
correlated with the activity under consideration. We learn highly specialized
attention maps for each frame using class-specific activations from a CNN
pre-trained for generic image recognition, and use them for spatio-temporal
encoding of the video with a convolutional LSTM. Our model is trained in a
weakly supervised setting using raw video-level activity-class labels.
Nonetheless, on standard egocentric activity benchmarks our model surpasses by
up to +6% points recognition accuracy the currently best performing method that
leverages hand segmentation and object location strong supervision for
training. We visually analyze attention maps generated by the network,
revealing that the network successfully identifies the relevant objects present
in the video frames which may explain the strong recognition performance. We
also discuss an extensive ablation analysis regarding the design choices.Comment: Accepted to BMVC 201
- …