77 research outputs found
Open-Vocabulary Segmentation with Semantic-Assisted Calibration
This paper studies open-vocabulary segmentation (OVS) through calibrating
in-vocabulary and domain-biased embedding space with generalized contextual
prior of CLIP. As the core of open-vocabulary understanding, alignment of
visual content with the semantics of unbounded text has become the bottleneck
of this field. To address this challenge, recent works propose to utilize CLIP
as an additional classifier and aggregate model predictions with CLIP
classification results. Despite their remarkable progress, performance of OVS
methods in relevant scenarios is still unsatisfactory compared with supervised
counterparts. We attribute this to the in-vocabulary embedding and
domain-biased CLIP prediction. To this end, we present a Semantic-assisted
CAlibration Network (SCAN). In SCAN, we incorporate generalized semantic prior
of CLIP into proposal embedding to avoid collapsing on known categories.
Besides, a contextual shift strategy is applied to mitigate the lack of global
context and unnatural background noise. With above designs, SCAN achieves
state-of-the-art performance on all popular open-vocabulary segmentation
benchmarks. Furthermore, we also focus on the problem of existing evaluation
system that ignores semantic duplication across categories, and propose a new
metric called Semantic-Guided IoU (SG-IoU)
Language-free Compositional Action Generation via Decoupling Refinement
Composing simple elements into complex concepts is crucial yet challenging,
especially for 3D action generation. Existing methods largely rely on extensive
neural language annotations to discern composable latent semantics, a process
that is often costly and labor-intensive. In this study, we introduce a novel
framework to generate compositional actions without reliance on language
auxiliaries. Our approach consists of three main components: Action Coupling,
Conditional Action Generation, and Decoupling Refinement. Action Coupling
utilizes an energy model to extract the attention masks of each sub-action,
subsequently integrating two actions using these attentions to generate
pseudo-training examples. Then, we employ a conditional generative model, CVAE,
to learn a latent space, facilitating the diverse generation. Finally, we
propose Decoupling Refinement, which leverages a self-supervised pre-trained
model MAE to ensure semantic consistency between the sub-actions and
compositional actions. This refinement process involves rendering generated 3D
actions into 2D space, decoupling these images into two sub-segments, using the
MAE model to restore the complete image from sub-segments, and constraining the
recovered images to match images rendered from raw sub-actions. Due to the lack
of existing datasets containing both sub-actions and compositional actions, we
created two new datasets, named HumanAct-C and UESTC-C, and present a
corresponding evaluation metric. Both qualitative and quantitative assessments
are conducted to show our efficacy.Comment: preprin
BNV-Fusion: Dense 3D Reconstruction using Bi-level Neural Volume Fusion
Dense 3D reconstruction from a stream of depth images is the key to many
mixed reality and robotic applications. Although methods based on Truncated
Signed Distance Function (TSDF) Fusion have advanced the field over the years,
the TSDF volume representation is confronted with striking a balance between
the robustness to noisy measurements and maintaining the level of detail. We
present Bi-level Neural Volume Fusion (BNV-Fusion), which leverages recent
advances in neural implicit representations and neural rendering for dense 3D
reconstruction. In order to incrementally integrate new depth maps into a
global neural implicit representation, we propose a novel bi-level fusion
strategy that considers both efficiency and reconstruction quality by design.
We evaluate the proposed method on multiple datasets quantitatively and
qualitatively, demonstrating a significant improvement over existing methods.Comment: Accepted at CVPR 202
Self-similarity-based super-resolution of photoacoustic angiography from hand-drawn doodles
Deep-learning-based super-resolution photoacoustic angiography (PAA) is a
powerful tool that restores blood vessel images from under-sampled images to
facilitate disease diagnosis. Nonetheless, due to the scarcity of training
samples, PAA super-resolution models often exhibit inadequate generalization
capabilities, particularly in the context of continuous monitoring tasks. To
address this challenge, we propose a novel approach that employs a
super-resolution PAA method trained with forged PAA images. We start by
generating realistic PAA images of human lips from hand-drawn curves using a
diffusion-based image generation model. Subsequently, we train a
self-similarity-based super-resolution model with these forged PAA images.
Experimental results show that our method outperforms the super-resolution
model trained with authentic PAA images in both original-domain and
cross-domain tests. Specially, our approach boosts the quality of
super-resolution reconstruction using the images forged by the deep learning
model, indicating that the collaboration between deep learning models can
facilitate generalization, despite limited initial dataset. This approach shows
promising potential for exploring zero-shot learning neural networks for vision
tasks.Comment: 12 pages, 6 figures, journa
Towards Accurate Data-free Quantization for Diffusion Models
In this paper, we propose an accurate data-free post-training quantization
framework of diffusion models (ADP-DM) for efficient image generation.
Conventional data-free quantization methods learn shared quantization functions
for tensor discretization regardless of the generation timesteps, while the
activation distribution differs significantly across various timesteps. The
calibration images are acquired in random timesteps which fail to provide
sufficient information for generalizable quantization function learning. Both
issues cause sizable quantization errors with obvious image generation
performance degradation. On the contrary, we design group-wise quantization
functions for activation discretization in different timesteps and sample the
optimal timestep for informative calibration image generation, so that our
quantized diffusion model can reduce the discretization errors with negligible
computational overhead. Specifically, we partition the timesteps according to
the importance weights of quantization functions in different groups, which are
optimized by differentiable search algorithms. We also select the optimal
timestep for calibration image generation by structural risk minimizing
principle in order to enhance the generalization ability in the deployment of
quantized diffusion model. Extensive experimental results show that our method
outperforms the state-of-the-art post-training quantization of diffusion model
by a sizable margin with similar computational cost
Universal Segmentation at Arbitrary Granularity with Language Instruction
This paper aims to achieve universal segmentation of arbitrary semantic
level. Despite significant progress in recent years, specialist segmentation
approaches are limited to specific tasks and data distribution. Retraining a
new model for adaptation to new scenarios or settings takes expensive
computation and time cost, which raises the demand for versatile and universal
segmentation model that can cater to various granularity. Although some
attempts have been made for unifying different segmentation tasks or
generalization to various scenarios, limitations in the definition of paradigms
and input-output spaces make it difficult for them to achieve accurate
understanding of content at arbitrary granularity. To this end, we present
UniLSeg, a universal segmentation model that can perform segmentation at any
semantic level with the guidance of language instructions. For training
UniLSeg, we reorganize a group of tasks from original diverse distributions
into a unified data format, where images with texts describing segmentation
targets as input and corresponding masks are output. Combined with a automatic
annotation engine for utilizing numerous unlabeled data, UniLSeg achieves
excellent performance on various tasks and settings, surpassing both specialist
and unified segmentation models
DPMesh: Exploiting Diffusion Prior for Occluded Human Mesh Recovery
The recovery of occluded human meshes presents challenges for current methods
due to the difficulty in extracting effective image features under severe
occlusion. In this paper, we introduce DPMesh, an innovative framework for
occluded human mesh recovery that capitalizes on the profound diffusion prior
about object structure and spatial relationships embedded in a pre-trained
text-to-image diffusion model. Unlike previous methods reliant on conventional
backbones for vanilla feature extraction, DPMesh seamlessly integrates the
pre-trained denoising U-Net with potent knowledge as its image backbone and
performs a single-step inference to provide occlusion-aware information. To
enhance the perception capability for occluded poses, DPMesh incorporates
well-designed guidance via condition injection, which produces effective
controls from 2D observations for the denoising U-Net. Furthermore, we explore
a dedicated noisy key-point reasoning approach to mitigate disturbances arising
from occlusion and crowded scenarios. This strategy fully unleashes the
perceptual capability of the diffusion prior, thereby enhancing accuracy.
Extensive experiments affirm the efficacy of our framework, as we outperform
state-of-the-art methods on both occlusion-specific and standard datasets. The
persuasive results underscore its ability to achieve precise and robust 3D
human mesh recovery, particularly in challenging scenarios involving occlusion
and crowded scenes.Comment: Accepted by IEEE/CVF Conference on Computer Vision and Pattern
Recognition (CVPR) 202
OrdinalCLIP: Learning Rank Prompts for Language-Guided Ordinal Regression
This paper presents a language-powered paradigm for ordinal regression.
Existing methods usually treat each rank as a category and employ a set of
weights to learn these concepts. These methods are easy to overfit and usually
attain unsatisfactory performance as the learned concepts are mainly derived
from the training set. Recent large pre-trained vision-language models like
CLIP have shown impressive performance on various visual tasks. In this paper,
we propose to learn the rank concepts from the rich semantic CLIP latent space.
Specifically, we reformulate this task as an image-language matching problem
with a contrastive objective, which regards labels as text and obtains a
language prototype from a text encoder for each rank. While prompt engineering
for CLIP is extremely time-consuming, we propose OrdinalCLIP, a differentiable
prompting method for adapting CLIP for ordinal regression. OrdinalCLIP consists
of learnable context tokens and learnable rank embeddings; The learnable rank
embeddings are constructed by explicitly modeling numerical continuity,
resulting in well-ordered, compact language prototypes in the CLIP space. Once
learned, we can only save the language prototypes and discard the huge language
model, resulting in zero additional computational overhead compared with the
linear head counterpart. Experimental results show that our paradigm achieves
competitive performance in general ordinal regression tasks, and gains
improvements in few-shot and distribution shift settings for age estimation.
The code is available at https://github.com/xk-huang/OrdinalCLIP.Comment: Accepted by NeurIPS2022. Code is available at
https://github.com/xk-huang/OrdinalCLI
1st Place Solution for 5th LSVOS Challenge: Referring Video Object Segmentation
The recent transformer-based models have dominated the Referring Video Object
Segmentation (RVOS) task due to the superior performance. Most prior works
adopt unified DETR framework to generate segmentation masks in
query-to-instance manner. In this work, we integrate strengths of that leading
RVOS models to build up an effective paradigm. We first obtain binary mask
sequences from the RVOS models. To improve the consistency and quality of
masks, we propose Two-Stage Multi-Model Fusion strategy. Each stage rationally
ensembles RVOS models based on framework design as well as training strategy,
and leverages different video object segmentation (VOS) models to enhance mask
coherence by object propagation mechanism. Our method achieves 75.7% J&F on
Ref-Youtube-VOS validation set and 70% J&F on test set, which ranks 1st place
on 5th Large-scale Video Object Segmentation Challenge (ICCV 2023) track 3.
Code is available at https://github.com/RobertLuo1/iccv2023_RVOS_Challenge
- …