120 research outputs found
Layer-Wise Cross-View Decoding for Sequence-to-Sequence Learning
In sequence-to-sequence learning, the decoder relies on the attention
mechanism to efficiently extract information from the encoder. While it is
common practice to draw information from only the last encoder layer, recent
work has proposed to use representations from different encoder layers for
diversified levels of information. Nonetheless, the decoder still obtains only
a single view of the source sequences, which might lead to insufficient
training of the encoder layer stack due to the hierarchy bypassing problem. In
this work, we propose layer-wise cross-view decoding, where for each decoder
layer, together with the representations from the last encoder layer, which
serve as a global view, those from other encoder layers are supplemented for a
stereoscopic view of the source sequences. Systematic experiments show that we
successfully address the hierarchy bypassing problem and substantially improve
the performance of sequence-to-sequence learning with deep representations on
diverse tasks.Comment: 9 pages, 6 figure
Aligning Source Visual and Target Language Domains for Unpaired Video Captioning
Training supervised video captioning model requires coupled video-caption
pairs. However, for many targeted languages, sufficient paired data are not
available. To this end, we introduce the unpaired video captioning task aiming
to train models without coupled video-caption pairs in target language. To
solve the task, a natural choice is to employ a two-step pipeline system: first
utilizing video-to-pivot captioning model to generate captions in pivot
language and then utilizing pivot-to-target translation model to translate the
pivot captions to the target language. However, in such a pipeline system, 1)
visual information cannot reach the translation model, generating visual
irrelevant target captions; 2) the errors in the generated pivot captions will
be propagated to the translation model, resulting in disfluent target captions.
To address these problems, we propose the Unpaired Video Captioning with Visual
Injection system (UVC-VI). UVC-VI first introduces the Visual Injection Module
(VIM), which aligns source visual and target language domains to inject the
source visual information into the target language domain. Meanwhile, VIM
directly connects the encoder of the video-to-pivot model and the decoder of
the pivot-to-target model, allowing end-to-end inference by completely skipping
the generation of pivot captions. To enhance the cross-modality injection of
the VIM, UVC-VI further introduces a pluggable video encoder, i.e., Multimodal
Collaborative Encoder (MCE). The experiments show that UVC-VI outperforms
pipeline systems and exceeds several supervised systems. Furthermore, equipping
existing supervised systems with our MCE can achieve 4% and 7% relative margins
on the CIDEr scores to current state-of-the-art models on the benchmark MSVD
and MSR-VTT datasets, respectively.Comment: Published at IEEE Transactions on Pattern Analysis and Machine
Intelligence (TPAMI
Prophet Attention: Predicting Attention with Future Attention for Image Captioning
Recently, attention based models have been used extensively in many
sequence-to-sequence learning systems. Especially for image captioning, the
attention based models are expected to ground correct image regions with proper
generated words. However, for each time step in the decoding process, the
attention based models usually use the hidden state of the current input to
attend to the image regions. Under this setting, these attention models have a
"deviated focus" problem that they calculate the attention weights based on
previous words instead of the one to be generated, impairing the performance of
both grounding and captioning. In this paper, we propose the Prophet Attention,
similar to the form of self-supervision. In the training stage, this module
utilizes the future information to calculate the "ideal" attention weights
towards image regions. These calculated "ideal" weights are further used to
regularize the "deviated" attention. In this manner, image regions are grounded
with the correct words. The proposed Prophet Attention can be easily
incorporated into existing image captioning models to improve their performance
of both grounding and captioning. The experiments on the Flickr30k Entities and
the MSCOCO datasets show that the proposed Prophet Attention consistently
outperforms baselines in both automatic metrics and human evaluations. It is
worth noticing that we set new state-of-the-arts on the two benchmark datasets
and achieve the 1st place on the leaderboard of the online MSCOCO benchmark in
terms of the default ranking score, i.e., CIDEr-c40.Comment: Accepted by NeurIPS 202
Contrastive Attention for Automatic Chest X-ray Report Generation
Recently, chest X-ray report generation, which aims to automatically generate
descriptions of given chest X-ray images, has received growing research
interests. The key challenge of chest X-ray report generation is to accurately
capture and describe the abnormal regions. In most cases, the normal regions
dominate the entire chest X-ray image, and the corresponding descriptions of
these normal regions dominate the final report. Due to such data bias,
learning-based models may fail to attend to abnormal regions. In this work, to
effectively capture and describe abnormal regions, we propose the Contrastive
Attention (CA) model. Instead of solely focusing on the current input image,
the CA model compares the current input image with normal images to distill the
contrastive information. The acquired contrastive information can better
represent the visual features of abnormal regions. According to the experiments
on the public IU-X-ray and MIMIC-CXR datasets, incorporating our CA into
several existing models can boost their performance across most metrics. In
addition, according to the analysis, the CA model can help existing models
better attend to the abnormal regions and provide more accurate descriptions
which are crucial for an interpretable diagnosis. Specifically, we achieve the
state-of-the-art results on the two public datasets.Comment: Appear in Findings of ACL 2021 (The Joint Conference of the 59th
Annual Meeting of the Association for Computational Linguistics and the 11th
International Joint Conference on Natural Language Processing (ACL-IJCNLP
2021)
Research on Acquisition and Communication Technology of Marine Equipment Fault Diagnosis Information for Vessels in Inland Rivers
Abstract: This study describes a system for the analysis of the fault diagnosis of marine equipments in inland rivers. Since the normal operation of the marine equipments significantly influences the ship safety, this study develops an efficient fault diagnosis information system for the condition monitoring and fault diagnosis of marine equipments. In this new system, field bus and ship-to-shore communication technologies have been integrated for the fault diagnosis information acquisition. Then the application of network bus, including CAN and RS485, has been employed to connect the fault diagnosis information with Ethernet in the ship. Lastly, for the real time and wireless transmission of the fault information, the Automatic Identification System (AIS) technology has been adopted to provide accurate and reliable fault diagnosis information transmission from ships to onshore diagnosis center. A comprehensive study of the application of proposed fault diagnosis information system has been implemented for remote diagnosis of marine equipments. The analysis results demonstrate that the newly developed fault diagnosis information system can enhance the fault diagnosis precision and hence is competent for the condition monitoring and fault diagnosis of marine equipments in inland rivers
- …