59 research outputs found
GFF: Gated Fully Fusion for Semantic Segmentation
Semantic segmentation generates comprehensive understanding of scenes through
densely predicting the category for each pixel. High-level features from Deep
Convolutional Neural Networks already demonstrate their effectiveness in
semantic segmentation tasks, however the coarse resolution of high-level
features often leads to inferior results for small/thin objects where detailed
information is important. It is natural to consider importing low level
features to compensate for the lost detailed information in high-level
features.Unfortunately, simply combining multi-level features suffers from the
semantic gap among them. In this paper, we propose a new architecture, named
Gated Fully Fusion (GFF), to selectively fuse features from multiple levels
using gates in a fully connected way. Specifically, features at each level are
enhanced by higher-level features with stronger semantics and lower-level
features with more details, and gates are used to control the propagation of
useful information which significantly reduces the noises during fusion. We
achieve the state of the art results on four challenging scene parsing datasets
including Cityscapes, Pascal Context, COCO-stuff and ADE20K.Comment: accepted by AAAI-2020(oral
Towards Robust Referring Image Segmentation
Referring Image Segmentation (RIS) aims to connect image and language via
outputting the corresponding object masks given a text description, which is a
fundamental vision-language task. Despite lots of works that have achieved
considerable progress for RIS, in this work, we explore an essential question,
"what if the description is wrong or misleading of the text description?". We
term such a sentence as a negative sentence. However, we find that existing
works cannot handle such settings. To this end, we propose a novel formulation
of RIS, named Robust Referring Image Segmentation (R-RIS). It considers the
negative sentence inputs besides the regularly given text inputs. We present
three different datasets via augmenting the input negative sentences and a new
metric to unify both input types. Furthermore, we design a new
transformer-based model named RefSegformer, where we introduce a token-based
vision and language fusion module. Such module can be easily extended to our
R-RIS setting by adding extra blank tokens. Our proposed RefSegformer achieves
the new state-of-the-art results on three regular RIS datasets and three R-RIS
datasets, which serves as a new solid baseline for further research. The
project page is at \url{https://lxtgh.github.io/project/robust_ref_seg/}.Comment: technical repor
MosaicFusion: Diffusion Models as Data Augmenters for Large Vocabulary Instance Segmentation
We present MosaicFusion, a simple yet effective diffusion-based data
augmentation approach for large vocabulary instance segmentation. Our method is
training-free and does not rely on any label supervision. Two key designs
enable us to employ an off-the-shelf text-to-image diffusion model as a useful
dataset generator for object instances and mask annotations. First, we divide
an image canvas into several regions and perform a single round of diffusion
process to generate multiple instances simultaneously, conditioning on
different text prompts. Second, we obtain corresponding instance masks by
aggregating cross-attention maps associated with object prompts across layers
and diffusion time steps, followed by simple thresholding and edge-aware
refinement processing. Without bells and whistles, our MosaicFusion can produce
a significant amount of synthetic labeled data for both rare and novel
categories. Experimental results on the challenging LVIS long-tailed and
open-vocabulary benchmarks demonstrate that MosaicFusion can significantly
improve the performance of existing instance segmentation models, especially
for rare and novel categories. Code will be released at
https://github.com/Jiahao000/MosaicFusion.Comment: GitHub: https://github.com/Jiahao000/MosaicFusio
Betrayed by Captions: Joint Caption Grounding and Generation for Open Vocabulary Instance Segmentation
In this work, we focus on open vocabulary instance segmentation to expand a
segmentation model to classify and segment instance-level novel categories.
Previous approaches have relied on massive caption datasets and complex
pipelines to establish one-to-one mappings between image regions and words in
captions. However, such methods build noisy supervision by matching non-visible
words to image regions, such as adjectives and verbs. Meanwhile, context words
are also important for inferring the existence of novel objects as they show
high inter-correlations with novel categories. To overcome these limitations,
we devise a joint \textbf{Caption Grounding and Generation (CGG)} framework,
which incorporates a novel grounding loss that only focuses on matching object
nouns to improve learning efficiency. We also introduce a caption generation
head that enables additional supervision and contextual modeling as a
complementation to the grounding loss. Our analysis and results demonstrate
that grounding and generation components complement each other, significantly
enhancing the segmentation performance for novel classes. Experiments on the
COCO dataset with two settings: Open Vocabulary Instance Segmentation (OVIS)
and Open Set Panoptic Segmentation (OSPS) demonstrate the superiority of the
CGG. Specifically, CGG achieves a substantial improvement of 6.8% mAP for novel
classes without extra data on the OVIS task and 15% PQ improvements for novel
classes on the OSPS benchmark.Comment: ICCV-202
Iterative Robust Visual Grounding with Masked Reference based Centerpoint Supervision
Visual Grounding (VG) aims at localizing target objects from an image based
on given expressions and has made significant progress with the development of
detection and vision transformer. However, existing VG methods tend to generate
false-alarm objects when presented with inaccurate or irrelevant descriptions,
which commonly occur in practical applications. Moreover, existing methods fail
to capture fine-grained features, accurate localization, and sufficient context
comprehension from the whole image and textual descriptions. To address both
issues, we propose an Iterative Robust Visual Grounding (IR-VG) framework with
Masked Reference based Centerpoint Supervision (MRCS). The framework introduces
iterative multi-level vision-language fusion (IMVF) for better alignment. We
use MRCS to ahieve more accurate localization with point-wised feature
supervision. Then, to improve the robustness of VG, we also present a
multi-stage false-alarm sensitive decoder (MFSD) to prevent the generation of
false-alarm objects when presented with inaccurate expressions. The proposed
framework is evaluated on five regular VG datasets and two newly constructed
robust VG datasets. Extensive experiments demonstrate that IR-VG achieves new
state-of-the-art (SOTA) results, with improvements of 25\% and 10\% compared to
existing SOTA approaches on the two newly proposed robust VG datasets.
Moreover, the proposed framework is also verified effective on five regular VG
datasets. Codes and models will be publicly at
https://github.com/cv516Buaa/IR-VG
PointFlow: Flowing Semantics Through Points for Aerial Image Segmentation
Aerial Image Segmentation is a particular semantic segmentation problem and
has several challenging characteristics that general semantic segmentation does
not have. There are two critical issues: The one is an extremely
foreground-background imbalanced distribution, and the other is multiple small
objects along with the complex background. Such problems make the recent dense
affinity context modeling perform poorly even compared with baselines due to
over-introduced background context. To handle these problems, we propose a
point-wise affinity propagation module based on the Feature Pyramid Network
(FPN) framework, named PointFlow. Rather than dense affinity learning, a sparse
affinity map is generated upon selected points between the adjacent features,
which reduces the noise introduced by the background while keeping efficiency.
In particular, we design a dual point matcher to select points from the salient
area and object boundaries, respectively. Experimental results on three
different aerial segmentation datasets suggest that the proposed method is more
effective and efficient than state-of-the-art general semantic segmentation
methods. Especially, our methods achieve the best speed and accuracy trade-off
on three aerial benchmarks. Further experiments on three general semantic
segmentation datasets prove the generality of our method. Code will be provided
in (https: //github.com/lxtGH/PFSegNets).Comment: accepted by CVPR202
- …