43 research outputs found

    SKDF: A Simple Knowledge Distillation Framework for Distilling Open-Vocabulary Knowledge to Open-world Object Detector

    Full text link
    In this paper, we attempt to specialize the VLM model for OWOD tasks by distilling its open-world knowledge into a language-agnostic detector. Surprisingly, we observe that the combination of a simple \textbf{knowledge distillation} approach and the automatic pseudo-labeling mechanism in OWOD can achieve better performance for unknown object detection, even with a small amount of data. Unfortunately, knowledge distillation for unknown objects severely affects the learning of detectors with conventional structures for known objects, leading to catastrophic forgetting. To alleviate these problems, we propose the \textbf{down-weight loss function} for knowledge distillation from vision-language to single vision modality. Meanwhile, we propose the \textbf{cascade decouple decoding structure} that decouples the learning of localization and recognition to reduce the impact of category interactions of known and unknown objects on the localization learning process. Ablation experiments demonstrate that both of them are effective in mitigating the impact of open-world knowledge distillation on the learning of known objects. Additionally, to alleviate the current lack of comprehensive benchmarks for evaluating the ability of the open-world detector to detect unknown objects in the open world, we propose two benchmarks, which we name "\textbf{StandardSet}♡\heartsuit" and "\textbf{IntensiveSet}♠\spadesuit" respectively, based on the complexity of their testing scenarios. Comprehensive experiments performed on OWOD, MS-COCO, and our proposed benchmarks demonstrate the effectiveness of our methods. The code and proposed dataset are available at \url{https://github.com/xiaomabufei/SKDF}.Comment: arXiv admin note: substantial text overlap with arXiv:2303.1162

    A2A^2Nav: Action-Aware Zero-Shot Robot Navigation by Exploiting Vision-and-Language Ability of Foundation Models

    Full text link
    We study the task of zero-shot vision-and-language navigation (ZS-VLN), a practical yet challenging problem in which an agent learns to navigate following a path described by language instructions without requiring any path-instruction annotation data. Normally, the instructions have complex grammatical structures and often contain various action descriptions (e.g., "proceed beyond", "depart from"). How to correctly understand and execute these action demands is a critical problem, and the absence of annotated data makes it even more challenging. Note that a well-educated human being can easily understand path instructions without the need for any special training. In this paper, we propose an action-aware zero-shot VLN method (A2A^2Nav) by exploiting the vision-and-language ability of foundation models. Specifically, the proposed method consists of an instruction parser and an action-aware navigation policy. The instruction parser utilizes the advanced reasoning ability of large language models (e.g., GPT-3) to decompose complex navigation instructions into a sequence of action-specific object navigation sub-tasks. Each sub-task requires the agent to localize the object and navigate to a specific goal position according to the associated action demand. To accomplish these sub-tasks, an action-aware navigation policy is learned from freely collected action-specific datasets that reveal distinct characteristics of each action demand. We use the learned navigation policy for executing sub-tasks sequentially to follow the navigation instruction. Extensive experiments show A2A^2Nav achieves promising ZS-VLN performance and even surpasses the supervised learning methods on R2R-Habitat and RxR-Habitat datasets

    BigVideo: A Large-scale Video Subtitle Translation Dataset for Multimodal Machine Translation

    Full text link
    We present a large-scale video subtitle translation dataset, BigVideo, to facilitate the study of multi-modality machine translation. Compared with the widely used How2 and VaTeX datasets, BigVideo is more than 10 times larger, consisting of 4.5 million sentence pairs and 9,981 hours of videos. We also introduce two deliberately designed test sets to verify the necessity of visual information: Ambiguous with the presence of ambiguous words, and Unambiguous in which the text context is self-contained for translation. To better model the common semantics shared across texts and videos, we introduce a contrastive learning method in the cross-modal encoder. Extensive experiments on the BigVideo show that: a) Visual information consistently improves the NMT model in terms of BLEU, BLEURT, and COMET on both Ambiguous and Unambiguous test sets. b) Visual information helps disambiguation, compared to the strong text baseline on terminology-targeted scores and human evaluation. Dataset and our implementations are available at https://github.com/DeepLearnXMU/BigVideo-VMT.Comment: Accepted to ACL 2023 Finding

    Structural pathways for ultrafast melting of optically excited thin polycrystalline Palladium films

    Full text link
    Due to its extremely short timescale, the non-equilibrium melting of metals is exceptionally difficult to probe experimentally. The knowledge of melting mechanisms is thus based mainly on the results of theoretical predictions. This work reports on the investigation of ultrafast melting of thin polycrystalline Pd films studied by optical laser pump - X-ray free-electron laser probe experiments and molecular-dynamics simulations. By acquiring X-ray diffraction snapshots with sub-picosecond resolution, we capture the sample's atomic structure during its transition from the crystalline to the liquid state. Bridging the timescales of experiments and simulations allows us to formulate a realistic microscopic picture of melting. We demonstrate that the existing models of strongly non-equilibrium melting, developed for systems with relatively weak electron-phonon coupling, remain valid even for ultrafast heating rates achieved in femtosecond laser-excited Pd. Furthermore, we highlight the role of pre-existing and transiently generated crystal defects in the transition to the liquid state.Comment: main manuscript 33 pages, 9 figures; supplemental material 19 pages, 13 figures - all in one fil

    Structural pathways for ultrafast melting of optically excited thin polycrystalline Palladium films

    Get PDF
    Due to its extremely short timescale, the non-equilibrium melting of metals is exceptionally difficult to probe experimentally. The knowledge of melting mechanisms is thus based mainly on the results of theoretical predictions. This work reports on the investigation of ultrafast melting of thin polycrystalline Pd films studied by optical laser pump – X-ray free-electron laser probe experiments and molecular-dynamics simulations. By acquiring X-ray diffraction snapshots with sub-picosecond resolution, we capture the sample's atomic structure during its transition from the crystalline to the liquid state. Bridging the timescales of experiments and simulations allows us to formulate a realistic microscopic picture of the crystal-liquid transition. According to the experimental data, the melting process gradually accelerates with the increasing density of deposited energy. The molecular dynamics simulations reveal that the transition mechanism progressively varies from heterogeneous, initiated inside the material at structurally disordered grain boundaries, to homogenous, proceeding catastrophically in the crystal volume on a picosecond timescale comparable to that of electron-phonon coupling. We demonstrate that the existing models of strongly non-equilibrium melting, developed for systems with relatively weak electron-phonon coupling, remain valid even for ultrafast heating rates achieved in femtosecond laser-excited Pd. Furthermore, we highlight the role of pre-existing and transiently generated crystal defects in the transition to the liquid state.</p

    The Speed of Sound in Methane under Conditions of the Thermal Boundary Layer of Uranus

    Full text link
    We present the first direct observations of acoustic waves in warm dense matter. We analyze wavenumber- and energy-resolved X-ray spectra taken from warm dense methane created by laser-heating a cryogenic liquid jet. X-ray diffraction and inelastic free electron scattering yield sample conditions of 0.3±\pm0.1 eV and 0.8±\pm0.1 g/cm3^3, corresponding to a pressure of ∼\sim13 GPa and matching the conditions predicted in the thermal boundary layer between the inner and outer envelope of Uranus. Inelastic X-ray scattering was used to observe the collective oscillations of the ions. With a highly improved energy resolution of ∼\sim50 meV, we could clearly distinguish the Brillouin peaks from the quasi-elastic Rayleigh feature. Data at different wavenumbers were used to obtain a sound speed of 5.9±\pm0.5 km/s, which enabled us to validate the use of Birch's law in this new parameter regime.Comment: 7 pages, 4 figures with supplementary informatio
    corecore