623,703 research outputs found
Improving End-to-End Text Image Translation From the Auxiliary Text Translation Task
End-to-end text image translation (TIT), which aims at translating the source
language embedded in images to the target language, has attracted intensive
attention in recent research. However, data sparsity limits the performance of
end-to-end text image translation. Multi-task learning is a non-trivial way to
alleviate this problem via exploring knowledge from complementary related
tasks. In this paper, we propose a novel text translation enhanced text image
translation, which trains the end-to-end model with text translation as an
auxiliary task. By sharing model parameters and multi-task training, our model
is able to take full advantage of easily-available large-scale text parallel
corpus. Extensive experimental results show our proposed method outperforms
existing end-to-end methods, and the joint multi-task learning with both text
translation and recognition tasks achieves better results, proving translation
and recognition auxiliary tasks are complementary.Comment: Accepted at the 26TH International Conference on Pattern Recognition
(ICPR 2022
Fusion of Multispectral Data Through Illumination-aware Deep Neural Networks for Pedestrian Detection
Multispectral pedestrian detection has received extensive attention in recent
years as a promising solution to facilitate robust human target detection for
around-the-clock applications (e.g. security surveillance and autonomous
driving). In this paper, we demonstrate illumination information encoded in
multispectral images can be utilized to significantly boost performance of
pedestrian detection. A novel illumination-aware weighting mechanism is present
to accurately depict illumination condition of a scene. Such illumination
information is incorporated into two-stream deep convolutional neural networks
to learn multispectral human-related features under different illumination
conditions (daytime and nighttime). Moreover, we utilized illumination
information together with multispectral data to generate more accurate semantic
segmentation which are used to boost pedestrian detection accuracy. Putting all
of the pieces together, we present a powerful framework for multispectral
pedestrian detection based on multi-task learning of illumination-aware
pedestrian detection and semantic segmentation. Our proposed method is trained
end-to-end using a well-designed multi-task loss function and outperforms
state-of-the-art approaches on KAIST multispectral pedestrian dataset
Discovering Discriminative Geometric Features with Self-Supervised Attention for Vehicle Re-Identification and Beyond
In the literature of vehicle re-identification (ReID), intensive manual
labels such as landmarks, critical parts or semantic segmentation masks are
often required to improve the performance. Such extra information helps to
detect locally geometric features as a part of representation learning for
vehicles. In contrast, in this paper, we aim to address the challenge of {\em
automatically} learning to detect geometric features as landmarks {\em with no
extra labels}. To the best of our knowledge, we are the {\em first} to
successfully learn discriminative geometric features for vehicle ReID based on
self-supervised attention. Specifically, we implement an end-to-end trainable
deep network architecture consisting of three branches: (1) a global branch as
backbone for image feature extraction, (2) an attentional branch for producing
attention masks, and (3) a self-supervised branch for regularizing the
attention learning with rotated images to locate geometric features. %Our
network design naturally leads to an end-to-end multi-task joint optimization.
We conduct comprehensive experiments on three benchmark datasets for vehicle
ReID, \ie VeRi-776, CityFlow-ReID, and VehicleID, and demonstrate our
state-of-the-art performance. %of our approach with the capability of capturing
informative vehicle parts with no corresponding manual labels. We also show the
good generalization of our approach in other ReID tasks such as person ReID and
multi-target multi-camera (MTMC) vehicle tracking. {\em Our demo code is
attached in the supplementary file.
Bridging the Gap between Pre-Training and Fine-Tuning for End-to-End Speech Translation
End-to-end speech translation, a hot topic in recent years, aims to translate
a segment of audio into a specific language with an end-to-end model.
Conventional approaches employ multi-task learning and pre-training methods for
this task, but they suffer from the huge gap between pre-training and
fine-tuning. To address these issues, we propose a Tandem Connectionist
Encoding Network (TCEN) which bridges the gap by reusing all subnets in
fine-tuning, keeping the roles of subnets consistent, and pre-training the
attention module. Furthermore, we propose two simple but effective methods to
guarantee the speech encoder outputs and the MT encoder inputs are consistent
in terms of semantic representation and sequence length. Experimental results
show that our model outperforms baselines 2.2 BLEU on a large benchmark
dataset.Comment: AAAI202
- …