94 research outputs found
CLIPood: Generalizing CLIP to Out-of-Distributions
Out-of-distribution (OOD) generalization, where the model needs to handle
distribution shifts from training, is a major challenge of machine learning.
Contrastive language-image pre-training (CLIP) models have shown impressive
zero-shot ability, but the further adaptation of CLIP on downstream tasks
undesirably degrades OOD performances. This paper aims at generalizing CLIP to
out-of-distribution test data on downstream tasks. We propose CLIPood, a
fine-tuning method that can adapt CLIP models to OOD situations where both
domain shifts and open classes may occur on the unseen test data. To exploit
the semantic relations between classes from the text modality, CLIPood
introduces a new training objective, margin metric softmax (MMS), with class
adaptive margins for fine-tuning. To incorporate both pre-trained zero-shot
model and fine-tuned task-adaptive model, CLIPood leverages a new optimization
strategy, Beta moving average (BMA), to maintain a temporal ensemble weighted
by Beta distribution. Experiments on diverse datasets with different OOD
scenarios show that CLIPood consistently outperforms existing generalization
techniques.Comment: Accepted by ICML 202
Exploring scale invariance in the expansion of a spherical unitary Fermi gas
A unitary Fermi gas in an isotropic harmonic trap is predicted to show scale
and conformal symmetry that have important consequences in its thermodynamic
and dynamical properties. By experimentally realizing an isotropic harmonic
trap, we study the expansion of a unitary Fermi gas and demonstrate its
universal expansion dynamics along different directions and at different
temperatures. We show that as a consequence of SO(2,1) symmetry, the measured
release energy is equal to that of the trapping energy. In addition, away from
resonance when scale invariance is broken, we determine the effective exponent
that relates the chemical potential and average density along the
BEC-BCS crossover, which qualitatively agrees with the mean field predictions.
This work opens the possibility of studying non-equilibrium dynamics in a
conformal invariant system in the future.Comment: 15 pages and 8 figur
Vote2Cap-DETR++: Decoupling Localization and Describing for End-to-End 3D Dense Captioning
3D dense captioning requires a model to translate its understanding of an
input 3D scene into several captions associated with different object regions.
Existing methods adopt a sophisticated "detect-then-describe" pipeline, which
builds explicit relation modules upon a 3D detector with numerous hand-crafted
components. While these methods have achieved initial success, the cascade
pipeline tends to accumulate errors because of duplicated and inaccurate box
estimations and messy 3D scenes. In this paper, we first propose Vote2Cap-DETR,
a simple-yet-effective transformer framework that decouples the decoding
process of caption generation and object localization through parallel
decoding. Moreover, we argue that object localization and description
generation require different levels of scene understanding, which could be
challenging for a shared set of queries to capture. To this end, we propose an
advanced version, Vote2Cap-DETR++, which decouples the queries into
localization and caption queries to capture task-specific features.
Additionally, we introduce the iterative spatial refinement strategy to vote
queries for faster convergence and better localization performance. We also
insert additional spatial information to the caption head for more accurate
descriptions. Without bells and whistles, extensive experiments on two commonly
used datasets, ScanRefer and Nr3D, demonstrate Vote2Cap-DETR and
Vote2Cap-DETR++ surpass conventional "detect-then-describe" methods by a large
margin. Codes will be made available at
https://github.com/ch3cook-fdu/Vote2Cap-DETR
- …