614 research outputs found
Understanding Video Transformers for Segmentation: A Survey of Application and Interpretability
Video segmentation encompasses a wide range of categories of problem
formulation, e.g., object, scene, actor-action and multimodal video
segmentation, for delineating task-specific scene components with pixel-level
masks. Recently, approaches in this research area shifted from concentrating on
ConvNet-based to transformer-based models. In addition, various
interpretability approaches have appeared for transformer models and video
temporal dynamics, motivated by the growing interest in basic scientific
understanding, model diagnostics and societal implications of real-world
deployment. Previous surveys mainly focused on ConvNet models on a subset of
video segmentation tasks or transformers for classification tasks. Moreover,
component-wise discussion of transformer-based video segmentation models has
not yet received due focus. In addition, previous reviews of interpretability
methods focused on transformers for classification, while analysis of video
temporal dynamics modelling capabilities of video models received less
attention. In this survey, we address the above with a thorough discussion of
various categories of video segmentation, a component-wise discussion of the
state-of-the-art transformer-based models, and a review of related
interpretability methods. We first present an introduction to the different
video segmentation task categories, their objectives, specific challenges and
benchmark datasets. Next, we provide a component-wise review of recent
transformer-based models and document the state of the art on different video
segmentation tasks. Subsequently, we discuss post-hoc and ante-hoc
interpretability methods for transformer models and interpretability methods
for understanding the role of the temporal dimension in video models. Finally,
we conclude our discussion with future research directions
Towards Stable Co-saliency Detection and Object Co-segmentation
In this paper, we present a novel model for simultaneous stable co-saliency
detection (CoSOD) and object co-segmentation (CoSEG). To detect co-saliency
(segmentation) accurately, the core problem is to well model inter-image
relations between an image group. Some methods design sophisticated modules,
such as recurrent neural network (RNN), to address this problem. However,
order-sensitive problem is the major drawback of RNN, which heavily affects the
stability of proposed CoSOD (CoSEG) model. In this paper, inspired by RNN-based
model, we first propose a multi-path stable recurrent unit (MSRU), containing
dummy orders mechanisms (DOM) and recurrent unit (RU). Our proposed MSRU not
only helps CoSOD (CoSEG) model captures robust inter-image relations, but also
reduces order-sensitivity, resulting in a more stable inference and training
process. { Moreover, we design a cross-order contrastive loss (COCL) that can
further address order-sensitive problem by pulling close the feature embedding
generated from different input orders.} We validate our model on five widely
used CoSOD datasets (CoCA, CoSOD3k, Cosal2015, iCoseg and MSRC), and three
widely used datasets (Internet, iCoseg and PASCAL-VOC) for object
co-segmentation, the performance demonstrates the superiority of the proposed
approach as compared to the state-of-the-art (SOTA) methods
CAVER: Cross-Modal View-Mixed Transformer for Bi-Modal Salient Object Detection
Most of the existing bi-modal (RGB-D and RGB-T) salient object detection
methods utilize the convolution operation and construct complex interweave
fusion structures to achieve cross-modal information integration. The inherent
local connectivity of the convolution operation constrains the performance of
the convolution-based methods to a ceiling. In this work, we rethink these
tasks from the perspective of global information alignment and transformation.
Specifically, the proposed \underline{c}ross-mod\underline{a}l
\underline{v}iew-mixed transform\underline{er} (CAVER) cascades several
cross-modal integration units to construct a top-down transformer-based
information propagation path. CAVER treats the multi-scale and multi-modal
feature integration as a sequence-to-sequence context propagation and update
process built on a novel view-mixed attention mechanism. Besides, considering
the quadratic complexity w.r.t. the number of input tokens, we design a
parameter-free patch-wise token re-embedding strategy to simplify operations.
Extensive experimental results on RGB-D and RGB-T SOD datasets demonstrate that
such a simple two-stream encoder-decoder framework can surpass recent
state-of-the-art methods when it is equipped with the proposed components.Comment: Updated version, more flexible structure, better performanc
- …