309 research outputs found

    A Fully Convolutional Tri-branch Network (FCTN) for Domain Adaptation

    Full text link
    A domain adaptation method for urban scene segmentation is proposed in this work. We develop a fully convolutional tri-branch network, where two branches assign pseudo labels to images in the unlabeled target domain while the third branch is trained with supervision based on images in the pseudo-labeled target domain. The re-labeling and re-training processes alternate. With this design, the tri-branch network learns target-specific discriminative representations progressively and, as a result, the cross-domain capability of the segmenter improves. We evaluate the proposed network on large-scale domain adaptation experiments using both synthetic (GTA) and real (Cityscapes) images. It is shown that our solution achieves the state-of-the-art performance and it outperforms previous methods by a significant margin.Comment: Accepted by ICASSP 201

    Experimental Results of Underwater Sound Speed Profile Inversion by Few-shot Multi-task Learning

    Full text link
    Underwater Sound Speed Profile (SSP) distribution has great influence on the propagation mode of acoustic signal, thus the fast and accurate estimation of SSP is of great importance in building underwater observation systems. The state-of-the-art SSP inversion methods include frameworks of matched field processing (MFP), compressive sensing (CS), and feedforeward neural networks (FNN), among which the FNN shows better real-time performance while maintain the same level of accuracy. However, the training of FNN needs quite a lot historical SSP samples, which is diffcult to be satisfied in many ocean areas. This situation is called few-shot learning. To tackle this issue, we propose a multi-task learning (MTL) model with partial parameter sharing among different traning tasks. By MTL, common features could be extracted, thus accelerating the learning process on given tasks, and reducing the demand for reference samples, so as to enhance the generalization ability in few-shot learning. To verify the feasibility and effectiveness of MTL, a deep-ocean experiment was held in April 2023 at the South China Sea. Results shows that MTL outperforms the state-of-the-art methods in terms of accuracy for SSP inversion, while inherits the real-time advantage of FNN during the inversion stage

    Jahn-Teller distortion driven ferromagnetism in a perovskite fluoride monolayer

    Full text link
    The Jahn-Teller distortion and the resulting orbital order usually cause some fascinating correlated electronic behaviors, and generally lead to antiferromagnetism in perovskite bulks. Here we demonstrate that the Jahn-Teller distortion present in the perovskite fluoride KCrF3_3 bulk can be retained to the two-dimensional limit, resulting in a staggered orbital order and ferromagnetism in the perovskite monolayer. Octahedral tilt and rotation distortion also appear in the ground-state structure of the perovskite monolayer, which have minor effects on the electronic and magnetic properties with respect to the Jahn-Teller distortion. In addition, in the prototype phase without structural distortion, the partial occupation of the ege_g orbitals leads to a ferromagnetic metallic state. This work facilitates the design of two-dimensional ferromagnets and functional properties based on Jahn-Teller distortion and orbital orderComment: 8 pages, 5 figures, 1 tabl

    Retrieving-to-Answer: Zero-Shot Video Question Answering with Frozen Large Language Models

    Full text link
    Video Question Answering (VideoQA) has been significantly advanced from the scaling of recent Large Language Models (LLMs). The key idea is to convert the visual information into the language feature space so that the capacity of LLMs can be fully exploited. Existing VideoQA methods typically take two paradigms: (1) learning cross-modal alignment, and (2) using an off-the-shelf captioning model to describe the visual data. However, the first design needs costly training on many extra multi-modal data, whilst the second is further limited by limited domain generalization. To address these limitations, a simple yet effective Retrieving-to-Answer (R2A) framework is proposed.Given an input video, R2A first retrieves a set of semantically similar texts from a generic text corpus using a pre-trained multi-modal model (e.g., CLIP). With both the question and the retrieved texts, a LLM (e.g., DeBERTa) can be directly used to yield a desired answer. Without the need for cross-modal fine-tuning, R2A allows for all the key components (e.g., LLM, retrieval model, and text corpus) to plug-and-play. Extensive experiments on several VideoQA benchmarks show that despite with 1.3B parameters and no fine-tuning, our R2A can outperform the 61 times larger Flamingo-80B model even additionally trained on nearly 2.1B multi-modal data
    • …
    corecore