100 research outputs found
GFF: Gated Fully Fusion for Semantic Segmentation
Semantic segmentation generates comprehensive understanding of scenes through
densely predicting the category for each pixel. High-level features from Deep
Convolutional Neural Networks already demonstrate their effectiveness in
semantic segmentation tasks, however the coarse resolution of high-level
features often leads to inferior results for small/thin objects where detailed
information is important. It is natural to consider importing low level
features to compensate for the lost detailed information in high-level
features.Unfortunately, simply combining multi-level features suffers from the
semantic gap among them. In this paper, we propose a new architecture, named
Gated Fully Fusion (GFF), to selectively fuse features from multiple levels
using gates in a fully connected way. Specifically, features at each level are
enhanced by higher-level features with stronger semantics and lower-level
features with more details, and gates are used to control the propagation of
useful information which significantly reduces the noises during fusion. We
achieve the state of the art results on four challenging scene parsing datasets
including Cityscapes, Pascal Context, COCO-stuff and ADE20K.Comment: accepted by AAAI-2020(oral
Towards Robust Referring Image Segmentation
Referring Image Segmentation (RIS) aims to connect image and language via
outputting the corresponding object masks given a text description, which is a
fundamental vision-language task. Despite lots of works that have achieved
considerable progress for RIS, in this work, we explore an essential question,
"what if the description is wrong or misleading of the text description?". We
term such a sentence as a negative sentence. However, we find that existing
works cannot handle such settings. To this end, we propose a novel formulation
of RIS, named Robust Referring Image Segmentation (R-RIS). It considers the
negative sentence inputs besides the regularly given text inputs. We present
three different datasets via augmenting the input negative sentences and a new
metric to unify both input types. Furthermore, we design a new
transformer-based model named RefSegformer, where we introduce a token-based
vision and language fusion module. Such module can be easily extended to our
R-RIS setting by adding extra blank tokens. Our proposed RefSegformer achieves
the new state-of-the-art results on three regular RIS datasets and three R-RIS
datasets, which serves as a new solid baseline for further research. The
project page is at \url{https://lxtgh.github.io/project/robust_ref_seg/}.Comment: technical repor
CPT-Interp: Continuous sPatial and Temporal Motion Modeling for 4D Medical Image Interpolation
Motion information from 4D medical imaging offers critical insights into
dynamic changes in patient anatomy for clinical assessments and radiotherapy
planning and, thereby, enhances the capabilities of 3D image analysis. However,
inherent physical and technical constraints of imaging hardware often
necessitate a compromise between temporal resolution and image quality. Frame
interpolation emerges as a pivotal solution to this challenge. Previous methods
often suffer from discretion when they estimate the intermediate motion and
execute the forward warping. In this study, we draw inspiration from fluid
mechanics to propose a novel approach for continuously modeling patient
anatomic motion using implicit neural representation. It ensures both spatial
and temporal continuity, effectively bridging Eulerian and Lagrangian
specifications together to naturally facilitate continuous frame interpolation.
Our experiments across multiple datasets underscore the method's superior
accuracy and speed. Furthermore, as a case-specific optimization
(training-free) approach, it circumvents the need for extensive datasets and
addresses model generalization issues
RTMO: Towards High-Performance One-Stage Real-Time Multi-Person Pose Estimation
Real-time multi-person pose estimation presents significant challenges in
balancing speed and precision. While two-stage top-down methods slow down as
the number of people in the image increases, existing one-stage methods often
fail to simultaneously deliver high accuracy and real-time performance. This
paper introduces RTMO, a one-stage pose estimation framework that seamlessly
integrates coordinate classification by representing keypoints using dual 1-D
heatmaps within the YOLO architecture, achieving accuracy comparable to
top-down methods while maintaining high speed. We propose a dynamic coordinate
classifier and a tailored loss function for heatmap learning, specifically
designed to address the incompatibilities between coordinate classification and
dense prediction models. RTMO outperforms state-of-the-art one-stage pose
estimators, achieving 1.1% higher AP on COCO while operating about 9 times
faster with the same backbone. Our largest model, RTMO-l, attains 74.8% AP on
COCO val2017 and 141 FPS on a single V100 GPU, demonstrating its efficiency and
accuracy. The code and models are available at
https://github.com/open-mmlab/mmpose/tree/main/projects/rtmo.Comment: Accepted at CVPR 2024. Project page:
https://github.com/open-mmlab/mmpose/tree/main/projects/rtm
VG4D: Vision-Language Model Goes 4D Video Recognition
Understanding the real world through point cloud video is a crucial aspect of
robotics and autonomous driving systems. However, prevailing methods for 4D
point cloud recognition have limitations due to sensor resolution, which leads
to a lack of detailed information. Recent advances have shown that
Vision-Language Models (VLM) pre-trained on web-scale text-image datasets can
learn fine-grained visual concepts that can be transferred to various
downstream tasks. However, effectively integrating VLM into the domain of 4D
point clouds remains an unresolved problem. In this work, we propose the
Vision-Language Models Goes 4D (VG4D) framework to transfer VLM knowledge from
visual-text pre-trained models to a 4D point cloud network. Our approach
involves aligning the 4D encoder's representation with a VLM to learn a shared
visual and text space from training on large-scale image-text pairs. By
transferring the knowledge of the VLM to the 4D encoder and combining the VLM,
our VG4D achieves improved recognition performance. To enhance the 4D encoder,
we modernize the classic dynamic point cloud backbone and propose an improved
version of PSTNet, im-PSTNet, which can efficiently model point cloud videos.
Experiments demonstrate that our method achieves state-of-the-art performance
for action recognition on both the NTU RGB+D 60 dataset and the NTU RGB+D 120
dataset. Code is available at \url{https://github.com/Shark0-0/VG4D}.Comment: ICRA 202
Open-Vocabulary SAM: Segment and Recognize Twenty-thousand Classes Interactively
The CLIP and Segment Anything Model (SAM) are remarkable vision foundation
models (VFMs). SAM excels in segmentation tasks across diverse domains, while
CLIP is renowned for its zero-shot recognition capabilities. This paper
presents an in-depth exploration of integrating these two models into a unified
framework. Specifically, we introduce the Open-Vocabulary SAM, a SAM-inspired
model designed for simultaneous interactive segmentation and recognition,
leveraging two unique knowledge transfer modules: SAM2CLIP and CLIP2SAM. The
former adapts SAM's knowledge into the CLIP via distillation and learnable
transformer adapters, while the latter transfers CLIP knowledge into SAM,
enhancing its recognition capabilities. Extensive experiments on various
datasets and detectors show the effectiveness of Open-Vocabulary SAM in both
segmentation and recognition tasks, significantly outperforming the naive
baselines of simply combining SAM and CLIP. Furthermore, aided with image
classification data training, our method can segment and recognize
approximately 22,000 classes.Comment: Project page: https://www.mmlab-ntu.com/project/ovsa
An Open and Comprehensive Pipeline for Unified Object Grounding and Detection
Grounding-DINO is a state-of-the-art open-set detection model that tackles
multiple vision tasks including Open-Vocabulary Detection (OVD), Phrase
Grounding (PG), and Referring Expression Comprehension (REC). Its effectiveness
has led to its widespread adoption as a mainstream architecture for various
downstream applications. However, despite its significance, the original
Grounding-DINO model lacks comprehensive public technical details due to the
unavailability of its training code. To bridge this gap, we present
MM-Grounding-DINO, an open-source, comprehensive, and user-friendly baseline,
which is built with the MMDetection toolbox. It adopts abundant vision datasets
for pre-training and various detection and grounding datasets for fine-tuning.
We give a comprehensive analysis of each reported result and detailed settings
for reproduction. The extensive experiments on the benchmarks mentioned
demonstrate that our MM-Grounding-DINO-Tiny outperforms the Grounding-DINO-Tiny
baseline. We release all our models to the research community. Codes and
trained models are released at
https://github.com/open-mmlab/mmdetection/tree/main/configs/mm_grounding_dino.Comment: 10 pages, 6 figure
Betrayed by Captions: Joint Caption Grounding and Generation for Open Vocabulary Instance Segmentation
In this work, we focus on open vocabulary instance segmentation to expand a
segmentation model to classify and segment instance-level novel categories.
Previous approaches have relied on massive caption datasets and complex
pipelines to establish one-to-one mappings between image regions and words in
captions. However, such methods build noisy supervision by matching non-visible
words to image regions, such as adjectives and verbs. Meanwhile, context words
are also important for inferring the existence of novel objects as they show
high inter-correlations with novel categories. To overcome these limitations,
we devise a joint \textbf{Caption Grounding and Generation (CGG)} framework,
which incorporates a novel grounding loss that only focuses on matching object
nouns to improve learning efficiency. We also introduce a caption generation
head that enables additional supervision and contextual modeling as a
complementation to the grounding loss. Our analysis and results demonstrate
that grounding and generation components complement each other, significantly
enhancing the segmentation performance for novel classes. Experiments on the
COCO dataset with two settings: Open Vocabulary Instance Segmentation (OVIS)
and Open Set Panoptic Segmentation (OSPS) demonstrate the superiority of the
CGG. Specifically, CGG achieves a substantial improvement of 6.8% mAP for novel
classes without extra data on the OVIS task and 15% PQ improvements for novel
classes on the OSPS benchmark.Comment: ICCV-202
- …