309 research outputs found
A Fully Convolutional Tri-branch Network (FCTN) for Domain Adaptation
A domain adaptation method for urban scene segmentation is proposed in this
work. We develop a fully convolutional tri-branch network, where two branches
assign pseudo labels to images in the unlabeled target domain while the third
branch is trained with supervision based on images in the pseudo-labeled target
domain. The re-labeling and re-training processes alternate. With this design,
the tri-branch network learns target-specific discriminative representations
progressively and, as a result, the cross-domain capability of the segmenter
improves. We evaluate the proposed network on large-scale domain adaptation
experiments using both synthetic (GTA) and real (Cityscapes) images. It is
shown that our solution achieves the state-of-the-art performance and it
outperforms previous methods by a significant margin.Comment: Accepted by ICASSP 201
Experimental Results of Underwater Sound Speed Profile Inversion by Few-shot Multi-task Learning
Underwater Sound Speed Profile (SSP) distribution has great influence on the
propagation mode of acoustic signal, thus the fast and accurate estimation of
SSP is of great importance in building underwater observation systems. The
state-of-the-art SSP inversion methods include frameworks of matched field
processing (MFP), compressive sensing (CS), and feedforeward neural networks
(FNN), among which the FNN shows better real-time performance while maintain
the same level of accuracy. However, the training of FNN needs quite a lot
historical SSP samples, which is diffcult to be satisfied in many ocean areas.
This situation is called few-shot learning. To tackle this issue, we propose a
multi-task learning (MTL) model with partial parameter sharing among different
traning tasks. By MTL, common features could be extracted, thus accelerating
the learning process on given tasks, and reducing the demand for reference
samples, so as to enhance the generalization ability in few-shot learning. To
verify the feasibility and effectiveness of MTL, a deep-ocean experiment was
held in April 2023 at the South China Sea. Results shows that MTL outperforms
the state-of-the-art methods in terms of accuracy for SSP inversion, while
inherits the real-time advantage of FNN during the inversion stage
Jahn-Teller distortion driven ferromagnetism in a perovskite fluoride monolayer
The Jahn-Teller distortion and the resulting orbital order usually cause some
fascinating correlated electronic behaviors, and generally lead to
antiferromagnetism in perovskite bulks. Here we demonstrate that the
Jahn-Teller distortion present in the perovskite fluoride KCrF bulk can be
retained to the two-dimensional limit, resulting in a staggered orbital order
and ferromagnetism in the perovskite monolayer. Octahedral tilt and rotation
distortion also appear in the ground-state structure of the perovskite
monolayer, which have minor effects on the electronic and magnetic properties
with respect to the Jahn-Teller distortion. In addition, in the prototype phase
without structural distortion, the partial occupation of the orbitals
leads to a ferromagnetic metallic state. This work facilitates the design of
two-dimensional ferromagnets and functional properties based on Jahn-Teller
distortion and orbital orderComment: 8 pages, 5 figures, 1 tabl
Retrieving-to-Answer: Zero-Shot Video Question Answering with Frozen Large Language Models
Video Question Answering (VideoQA) has been significantly advanced from the
scaling of recent Large Language Models (LLMs). The key idea is to convert the
visual information into the language feature space so that the capacity of LLMs
can be fully exploited. Existing VideoQA methods typically take two paradigms:
(1) learning cross-modal alignment, and (2) using an off-the-shelf captioning
model to describe the visual data. However, the first design needs costly
training on many extra multi-modal data, whilst the second is further limited
by limited domain generalization. To address these limitations, a simple yet
effective Retrieving-to-Answer (R2A) framework is proposed.Given an input
video, R2A first retrieves a set of semantically similar texts from a generic
text corpus using a pre-trained multi-modal model (e.g., CLIP). With both the
question and the retrieved texts, a LLM (e.g., DeBERTa) can be directly used to
yield a desired answer. Without the need for cross-modal fine-tuning, R2A
allows for all the key components (e.g., LLM, retrieval model, and text corpus)
to plug-and-play. Extensive experiments on several VideoQA benchmarks show that
despite with 1.3B parameters and no fine-tuning, our R2A can outperform the 61
times larger Flamingo-80B model even additionally trained on nearly 2.1B
multi-modal data
Simulation of Solidified Microstructure and Experimental Comparative Study of Twin-Roll Casting Aluminum Alloys
- …