50 research outputs found
Multi-Modal Mutual Attention and Iterative Interaction for Referring Image Segmentation
We address the problem of referring image segmentation that aims to generate
a mask for the object specified by a natural language expression. Many recent
works utilize Transformer to extract features for the target object by
aggregating the attended visual regions. However, the generic attention
mechanism in Transformer only uses the language input for attention weight
calculation, which does not explicitly fuse language features in its output.
Thus, its output feature is dominated by vision information, which limits the
model to comprehensively understand the multi-modal information, and brings
uncertainty for the subsequent mask decoder to extract the output mask. To
address this issue, we propose Multi-Modal Mutual Attention ()
and Multi-Modal Mutual Decoder () that better fuse information
from the two input modalities. Based on {}, we further propose
Iterative Multi-modal Interaction () to allow continuous and
in-depth interactions between language and vision features. Furthermore, we
introduce Language Feature Reconstruction () to prevent the
language information from being lost or distorted in the extracted feature.
Extensive experiments show that our proposed approach significantly improves
the baseline and outperforms state-of-the-art referring image segmentation
methods on RefCOCO series datasets consistently.Comment: IEEE TI
VGSG: Vision-Guided Semantic-Group Network for Text-based Person Search
Text-based Person Search (TBPS) aims to retrieve images of target pedestrian
indicated by textual descriptions. It is essential for TBPS to extract
fine-grained local features and align them crossing modality. Existing methods
utilize external tools or heavy cross-modal interaction to achieve explicit
alignment of cross-modal fine-grained features, which is inefficient and
time-consuming. In this work, we propose a Vision-Guided Semantic-Group Network
(VGSG) for text-based person search to extract well-aligned fine-grained visual
and textual features. In the proposed VGSG, we develop a Semantic-Group Textual
Learning (SGTL) module and a Vision-guided Knowledge Transfer (VGKT) module to
extract textual local features under the guidance of visual local clues. In
SGTL, in order to obtain the local textual representation, we group textual
features from the channel dimension based on the semantic cues of language
expression, which encourages similar semantic patterns to be grouped implicitly
without external tools. In VGKT, a vision-guided attention is employed to
extract visual-related textual features, which are inherently aligned with
visual cues and termed vision-guided textual features. Furthermore, we design a
relational knowledge transfer, including a vision-language similarity transfer
and a class probability transfer, to adaptively propagate information of the
vision-guided textual features to semantic-group textual features. With the
help of relational knowledge transfer, VGKT is capable of aligning
semantic-group textual features with corresponding visual features without
external tools and complex pairwise interaction. Experimental results on two
challenging benchmarks demonstrate its superiority over state-of-the-art
methods.Comment: Accepted to IEEE TI
Gradient-Semantic Compensation for Incremental Semantic Segmentation
Incremental semantic segmentation aims to continually learn the segmentation
of new coming classes without accessing the training data of previously learned
classes. However, most current methods fail to address catastrophic forgetting
and background shift since they 1) treat all previous classes equally without
considering different forgetting paces caused by imbalanced gradient
back-propagation; 2) lack strong semantic guidance between classes. To tackle
the above challenges, in this paper, we propose a Gradient-Semantic
Compensation (GSC) model, which surmounts incremental semantic segmentation
from both gradient and semantic perspectives. Specifically, to address
catastrophic forgetting from the gradient aspect, we develop a step-aware
gradient compensation that can balance forgetting paces of previously seen
classes via re-weighting gradient backpropagation. Meanwhile, we propose a
soft-sharp semantic relation distillation to distill consistent inter-class
semantic relations via soft labels for alleviating catastrophic forgetting from
the semantic aspect. In addition, we develop a prototypical pseudo re-labeling
that provides strong semantic guidance to mitigate background shift. It
produces high-quality pseudo labels for old classes in the background by
measuring distances between pixels and class-wise prototypes. Extensive
experiments on three public datasets, i.e., Pascal VOC 2012, ADE20K, and
Cityscapes, demonstrate the effectiveness of our proposed GSC model
Knowledge-aware Deep Framework for Collaborative Skin Lesion Segmentation and Melanoma Recognition
Deep learning techniques have shown their superior performance in
dermatologist clinical inspection. Nevertheless, melanoma diagnosis is still a
challenging task due to the difficulty of incorporating the useful
dermatologist clinical knowledge into the learning process. In this paper, we
propose a novel knowledge-aware deep framework that incorporates some clinical
knowledge into collaborative learning of two important melanoma diagnosis
tasks, i.e., skin lesion segmentation and melanoma recognition. Specifically,
to exploit the knowledge of morphological expressions of the lesion region and
also the periphery region for melanoma identification, a lesion-based pooling
and shape extraction (LPSE) scheme is designed, which transfers the structure
information obtained from skin lesion segmentation into melanoma recognition.
Meanwhile, to pass the skin lesion diagnosis knowledge from melanoma
recognition to skin lesion segmentation, an effective diagnosis guided feature
fusion (DGFF) strategy is designed. Moreover, we propose a recursive mutual
learning mechanism that further promotes the inter-task cooperation, and thus
iteratively improves the joint learning capability of the model for both skin
lesion segmentation and melanoma recognition. Experimental results on two
publicly available skin lesion datasets show the effectiveness of the proposed
method for melanoma analysis.Comment: Pattern Recognitio
MeViS: A Large-scale Benchmark for Video Segmentation with Motion Expressions
This paper strives for motion expressions guided video segmentation, which
focuses on segmenting objects in video content based on a sentence describing
the motion of the objects. Existing referring video object datasets typically
focus on salient objects and use language expressions that contain excessive
static attributes that could potentially enable the target object to be
identified in a single frame. These datasets downplay the importance of motion
in video content for language-guided video object segmentation. To investigate
the feasibility of using motion expressions to ground and segment objects in
videos, we propose a large-scale dataset called MeViS, which contains numerous
motion expressions to indicate target objects in complex environments. We
benchmarked 5 existing referring video object segmentation (RVOS) methods and
conducted a comprehensive comparison on the MeViS dataset. The results show
that current RVOS methods cannot effectively address motion expression-guided
video segmentation. We further analyze the challenges and propose a baseline
approach for the proposed MeViS dataset. The goal of our benchmark is to
provide a platform that enables the development of effective language-guided
video segmentation algorithms that leverage motion expressions as a primary cue
for object segmentation in complex video scenes. The proposed MeViS dataset has
been released at https://henghuiding.github.io/MeViS.Comment: ICCV 2023, Project Page: https://henghuiding.github.io/MeViS
Learning-Based Biharmonic Augmentation for Point Cloud Classification
Point cloud datasets often suffer from inadequate sample sizes in comparison
to image datasets, making data augmentation challenging. While traditional
methods, like rigid transformations and scaling, have limited potential in
increasing dataset diversity due to their constraints on altering individual
sample shapes, we introduce the Biharmonic Augmentation (BA) method. BA is a
novel and efficient data augmentation technique that diversifies point cloud
data by imposing smooth non-rigid deformations on existing 3D structures. This
approach calculates biharmonic coordinates for the deformation function and
learns diverse deformation prototypes. Utilizing a CoefNet, our method predicts
coefficients to amalgamate these prototypes, ensuring comprehensive
deformation. Moreover, we present AdvTune, an advanced online augmentation
system that integrates adversarial training. This system synergistically
refines the CoefNet and the classification network, facilitating the automated
creation of adaptive shape deformations contingent on the learner status.
Comprehensive experimental analysis validates the superiority of Biharmonic
Augmentation, showcasing notable performance improvements over prevailing point
cloud augmentation techniques across varied network designs
Towards Robust Referring Image Segmentation
Referring Image Segmentation (RIS) aims to connect image and language via
outputting the corresponding object masks given a text description, which is a
fundamental vision-language task. Despite lots of works that have achieved
considerable progress for RIS, in this work, we explore an essential question,
"what if the description is wrong or misleading of the text description?". We
term such a sentence as a negative sentence. However, we find that existing
works cannot handle such settings. To this end, we propose a novel formulation
of RIS, named Robust Referring Image Segmentation (R-RIS). It considers the
negative sentence inputs besides the regularly given text inputs. We present
three different datasets via augmenting the input negative sentences and a new
metric to unify both input types. Furthermore, we design a new
transformer-based model named RefSegformer, where we introduce a token-based
vision and language fusion module. Such module can be easily extended to our
R-RIS setting by adding extra blank tokens. Our proposed RefSegformer achieves
the new state-of-the-art results on three regular RIS datasets and three R-RIS
datasets, which serves as a new solid baseline for further research. The
project page is at \url{https://lxtgh.github.io/project/robust_ref_seg/}.Comment: technical repor
Federated Incremental Semantic Segmentation
Federated learning-based semantic segmentation (FSS) has drawn widespread
attention via decentralized training on local clients. However, most FSS models
assume categories are fixed in advance, thus heavily undergoing forgetting on
old categories in practical applications where local clients receive new
categories incrementally while have no memory storage to access old classes.
Moreover, new clients collecting novel classes may join in the global training
of FSS, which further exacerbates catastrophic forgetting. To surmount the
above challenges, we propose a Forgetting-Balanced Learning (FBL) model to
address heterogeneous forgetting on old classes from both intra-client and
inter-client aspects. Specifically, under the guidance of pseudo labels
generated via adaptive class-balanced pseudo labeling, we develop a
forgetting-balanced semantic compensation loss and a forgetting-balanced
relation consistency loss to rectify intra-client heterogeneous forgetting of
old categories with background shift. It performs balanced gradient propagation
and relation consistency distillation within local clients. Moreover, to tackle
heterogeneous forgetting from inter-client aspect, we propose a task transition
monitor. It can identify new classes under privacy protection and store the
latest old global model for relation distillation. Qualitative experiments
reveal large improvement of our model against comparison methods. The code is
available at https://github.com/JiahuaDong/FISS.Comment: Accepted to CVPR202