43 research outputs found
SKDF: A Simple Knowledge Distillation Framework for Distilling Open-Vocabulary Knowledge to Open-world Object Detector
In this paper, we attempt to specialize the VLM model for OWOD tasks by
distilling its open-world knowledge into a language-agnostic detector.
Surprisingly, we observe that the combination of a simple \textbf{knowledge
distillation} approach and the automatic pseudo-labeling mechanism in OWOD can
achieve better performance for unknown object detection, even with a small
amount of data. Unfortunately, knowledge distillation for unknown objects
severely affects the learning of detectors with conventional structures for
known objects, leading to catastrophic forgetting. To alleviate these problems,
we propose the \textbf{down-weight loss function} for knowledge distillation
from vision-language to single vision modality. Meanwhile, we propose the
\textbf{cascade decouple decoding structure} that decouples the learning of
localization and recognition to reduce the impact of category interactions of
known and unknown objects on the localization learning process. Ablation
experiments demonstrate that both of them are effective in mitigating the
impact of open-world knowledge distillation on the learning of known objects.
Additionally, to alleviate the current lack of comprehensive benchmarks for
evaluating the ability of the open-world detector to detect unknown objects in
the open world, we propose two benchmarks, which we name
"\textbf{StandardSet}" and "\textbf{IntensiveSet}"
respectively, based on the complexity of their testing scenarios. Comprehensive
experiments performed on OWOD, MS-COCO, and our proposed benchmarks demonstrate
the effectiveness of our methods. The code and proposed dataset are available
at \url{https://github.com/xiaomabufei/SKDF}.Comment: arXiv admin note: substantial text overlap with arXiv:2303.1162
Nav: Action-Aware Zero-Shot Robot Navigation by Exploiting Vision-and-Language Ability of Foundation Models
We study the task of zero-shot vision-and-language navigation (ZS-VLN), a
practical yet challenging problem in which an agent learns to navigate
following a path described by language instructions without requiring any
path-instruction annotation data. Normally, the instructions have complex
grammatical structures and often contain various action descriptions (e.g.,
"proceed beyond", "depart from"). How to correctly understand and execute these
action demands is a critical problem, and the absence of annotated data makes
it even more challenging. Note that a well-educated human being can easily
understand path instructions without the need for any special training. In this
paper, we propose an action-aware zero-shot VLN method (Nav) by exploiting
the vision-and-language ability of foundation models. Specifically, the
proposed method consists of an instruction parser and an action-aware
navigation policy. The instruction parser utilizes the advanced reasoning
ability of large language models (e.g., GPT-3) to decompose complex navigation
instructions into a sequence of action-specific object navigation sub-tasks.
Each sub-task requires the agent to localize the object and navigate to a
specific goal position according to the associated action demand. To accomplish
these sub-tasks, an action-aware navigation policy is learned from freely
collected action-specific datasets that reveal distinct characteristics of each
action demand. We use the learned navigation policy for executing sub-tasks
sequentially to follow the navigation instruction. Extensive experiments show
Nav achieves promising ZS-VLN performance and even surpasses the
supervised learning methods on R2R-Habitat and RxR-Habitat datasets
BigVideo: A Large-scale Video Subtitle Translation Dataset for Multimodal Machine Translation
We present a large-scale video subtitle translation dataset, BigVideo, to
facilitate the study of multi-modality machine translation. Compared with the
widely used How2 and VaTeX datasets, BigVideo is more than 10 times larger,
consisting of 4.5 million sentence pairs and 9,981 hours of videos. We also
introduce two deliberately designed test sets to verify the necessity of visual
information: Ambiguous with the presence of ambiguous words, and Unambiguous in
which the text context is self-contained for translation. To better model the
common semantics shared across texts and videos, we introduce a contrastive
learning method in the cross-modal encoder. Extensive experiments on the
BigVideo show that: a) Visual information consistently improves the NMT model
in terms of BLEU, BLEURT, and COMET on both Ambiguous and Unambiguous test
sets. b) Visual information helps disambiguation, compared to the strong text
baseline on terminology-targeted scores and human evaluation. Dataset and our
implementations are available at https://github.com/DeepLearnXMU/BigVideo-VMT.Comment: Accepted to ACL 2023 Finding
Structural pathways for ultrafast melting of optically excited thin polycrystalline Palladium films
Due to its extremely short timescale, the non-equilibrium melting of metals
is exceptionally difficult to probe experimentally. The knowledge of melting
mechanisms is thus based mainly on the results of theoretical predictions. This
work reports on the investigation of ultrafast melting of thin polycrystalline
Pd films studied by optical laser pump - X-ray free-electron laser probe
experiments and molecular-dynamics simulations. By acquiring X-ray diffraction
snapshots with sub-picosecond resolution, we capture the sample's atomic
structure during its transition from the crystalline to the liquid state.
Bridging the timescales of experiments and simulations allows us to formulate a
realistic microscopic picture of melting. We demonstrate that the existing
models of strongly non-equilibrium melting, developed for systems with
relatively weak electron-phonon coupling, remain valid even for ultrafast
heating rates achieved in femtosecond laser-excited Pd. Furthermore, we
highlight the role of pre-existing and transiently generated crystal defects in
the transition to the liquid state.Comment: main manuscript 33 pages, 9 figures; supplemental material 19 pages,
13 figures - all in one fil
Structural pathways for ultrafast melting of optically excited thin polycrystalline Palladium films
Due to its extremely short timescale, the non-equilibrium melting of metals is exceptionally difficult to probe experimentally. The knowledge of melting mechanisms is thus based mainly on the results of theoretical predictions. This work reports on the investigation of ultrafast melting of thin polycrystalline Pd films studied by optical laser pump – X-ray free-electron laser probe experiments and molecular-dynamics simulations. By acquiring X-ray diffraction snapshots with sub-picosecond resolution, we capture the sample's atomic structure during its transition from the crystalline to the liquid state. Bridging the timescales of experiments and simulations allows us to formulate a realistic microscopic picture of the crystal-liquid transition. According to the experimental data, the melting process gradually accelerates with the increasing density of deposited energy. The molecular dynamics simulations reveal that the transition mechanism progressively varies from heterogeneous, initiated inside the material at structurally disordered grain boundaries, to homogenous, proceeding catastrophically in the crystal volume on a picosecond timescale comparable to that of electron-phonon coupling. We demonstrate that the existing models of strongly non-equilibrium melting, developed for systems with relatively weak electron-phonon coupling, remain valid even for ultrafast heating rates achieved in femtosecond laser-excited Pd. Furthermore, we highlight the role of pre-existing and transiently generated crystal defects in the transition to the liquid state.</p
The Speed of Sound in Methane under Conditions of the Thermal Boundary Layer of Uranus
We present the first direct observations of acoustic waves in warm dense
matter. We analyze wavenumber- and energy-resolved X-ray spectra taken from
warm dense methane created by laser-heating a cryogenic liquid jet. X-ray
diffraction and inelastic free electron scattering yield sample conditions of
0.30.1 eV and 0.80.1 g/cm, corresponding to a pressure of
13 GPa and matching the conditions predicted in the thermal boundary
layer between the inner and outer envelope of Uranus. Inelastic X-ray
scattering was used to observe the collective oscillations of the ions. With a
highly improved energy resolution of 50 meV, we could clearly distinguish
the Brillouin peaks from the quasi-elastic Rayleigh feature. Data at different
wavenumbers were used to obtain a sound speed of 5.90.5 km/s, which
enabled us to validate the use of Birch's law in this new parameter regime.Comment: 7 pages, 4 figures with supplementary informatio