23,657 research outputs found
A Cross-Modal Image Fusion Method Guided by Human Visual Characteristics
The characteristics of feature selection, nonlinear combination and
multi-task auxiliary learning mechanism of the human visual perception system
play an important role in real-world scenarios, but the research of image
fusion theory based on the characteristics of human visual perception is less.
Inspired by the characteristics of human visual perception, we propose a robust
multi-task auxiliary learning optimization image fusion theory. Firstly, we
combine channel attention model with nonlinear convolutional neural network to
select features and fuse nonlinear features. Then, we analyze the impact of the
existing image fusion loss on the image fusion quality, and establish the
multi-loss function model of unsupervised learning network. Secondly, aiming at
the multi-task auxiliary learning mechanism of human visual perception system,
we study the influence of multi-task auxiliary learning mechanism on image
fusion task on the basis of single task multi-loss network model. By simulating
the three characteristics of human visual perception system, the fused image is
more consistent with the mechanism of human brain image fusion. Finally, in
order to verify the superiority of our algorithm, we carried out experiments on
the combined vision system image data set, and extended our algorithm to the
infrared and visible image and the multi-focus image public data set for
experimental verification. The experimental results demonstrate the superiority
of our fusion theory over state-of-arts in generality and robustness
MMFNet: A Multi-modality MRI Fusion Network for Segmentation of Nasopharyngeal Carcinoma
Segmentation of nasopharyngeal carcinoma (NPC) from Magnetic Resonance Images
(MRI) is a crucial prerequisite for NPC radiotherapy. However, manually
segmenting of NPC is time-consuming and labor-intensive. Additionally,
single-modality MRI generally cannot provide enough information for its
accurate delineation. Therefore, a multi-modality MRI fusion network (MMFNet)
based on three modalities of MRI (T1, T2 and contrast-enhanced T1) is proposed
to complete accurate segmentation of NPC. The backbone of MMFNet is designed as
a multi-encoder-based network, consisting of several encoders to capture
modality-specific features and one single decoder to fuse them and obtain
high-level features for NPC segmentation. A fusion block is presented to
effectively fuse features from multi-modality MRI. It firstly recalibrates
low-level features captured from modality-specific encoders to highlight both
informative features and regions of interest, then fuses weighted features by a
residual fusion block to keep balance between fused ones and high-level
features from decoder. Moreover, a training strategy named self-transfer, which
utilizes pre-trained modality-specific encoders to initialize
multi-encoder-based network, is proposed to make full mining of information
from different modalities of MRI. The proposed method based on multi-modality
MRI can effectively segment NPC and its advantages are validated by extensive
experiments.Comment: 34 pages, 12 figure
Deep Co-attention based Comparators For Relative Representation Learning in Person Re-identification
Person re-identification (re-ID) requires rapid, flexible yet discriminant
representations to quickly generalize to unseen observations on-the-fly and
recognize the same identity across disjoint camera views. Recent effective
methods are developed in a pair-wise similarity learning system to detect a
fixed set of features from distinct regions which are mapped to their vector
embeddings for the distance measuring. However, the most relevant and crucial
parts of each image are detected independently without referring to the
dependency conditioned on one and another. Also, these region based methods
rely on spatial manipulation to position the local features in comparable
similarity measuring. To combat these limitations, in this paper we introduce
the Deep Co-attention based Comparators (DCCs) that fuse the co-dependent
representations of the paired images so as to focus on the relevant parts of
both images and produce their \textit{relative representations}. Given a pair
of pedestrian images to be compared, the proposed model mimics the foveation of
human eyes to detect distinct regions concurrent on both images, namely
co-dependent features, and alternatively attend to relevant regions to fuse
them into the similarity learning. Our comparator is capable of producing
dynamic representations relative to a particular sample every time, and thus
well-suited to the case of re-identifying pedestrians on-the-fly. We perform
extensive experiments to provide the insights and demonstrate the effectiveness
of the proposed DCCs in person re-ID. Moreover, our approach has achieved the
state-of-the-art performance on three benchmark data sets: DukeMTMC-reID
\cite{DukeMTMC}, CUHK03 \cite{FPNN}, and Market-1501 \cite{Market1501}
A Symmetric Encoder-Decoder with Residual Block for Infrared and Visible Image Fusion
In computer vision and image processing tasks, image fusion has evolved into
an attractive research field. However, recent existing image fusion methods are
mostly built on pixel-level operations, which may produce unacceptable
artifacts and are time-consuming. In this paper, a symmetric encoder-decoder
with a residual block (SEDR) for infrared and visible image fusion is proposed.
For the training stage, the SEDR network is trained with a new dataset to
obtain a fixed feature extractor. For the fusion stage, first, the trained
model is utilized to extract the intermediate features and compensation
features of two source images. Then, extracted intermediate features are used
to generate two attention maps, which are multiplied to the input features for
refinement. In addition, the compensation features generated by the first two
convolutional layers are merged and passed to the corresponding deconvolutional
layers. At last, the refined features are fused for decoding to reconstruct the
final fused image. Experimental results demonstrate that the proposed fusion
method (named as SEDRFuse) outperforms the state-of-the-art fusion methods in
terms of both subjective and objective evaluations
Multiresolution and Multimodal Speech Recognition with Transformers
This paper presents an audio visual automatic speech recognition (AV-ASR)
system using a Transformer-based architecture. We particularly focus on the
scene context provided by the visual information, to ground the ASR. We extract
representations for audio features in the encoder layers of the transformer and
fuse video features using an additional crossmodal multihead attention layer.
Additionally, we incorporate a multitask training criterion for multiresolution
ASR, where we train the model to generate both character and subword level
transcriptions.
Experimental results on the How2 dataset, indicate that multiresolution
training can speed up convergence by around 50% and relatively improves word
error rate (WER) performance by upto 18% over subword prediction models.
Further, incorporating visual information improves performance with relative
gains upto 3.76% over audio only models.
Our results are comparable to state-of-the-art Listen, Attend and Spell-based
architectures.Comment: Accepted for ACL 202
Selective Kernel Networks
In standard Convolutional Neural Networks (CNNs), the receptive fields of
artificial neurons in each layer are designed to share the same size. It is
well-known in the neuroscience community that the receptive field size of
visual cortical neurons are modulated by the stimulus, which has been rarely
considered in constructing CNNs. We propose a dynamic selection mechanism in
CNNs that allows each neuron to adaptively adjust its receptive field size
based on multiple scales of input information. A building block called
Selective Kernel (SK) unit is designed, in which multiple branches with
different kernel sizes are fused using softmax attention that is guided by the
information in these branches. Different attentions on these branches yield
different sizes of the effective receptive fields of neurons in the fusion
layer. Multiple SK units are stacked to a deep network termed Selective Kernel
Networks (SKNets). On the ImageNet and CIFAR benchmarks, we empirically show
that SKNet outperforms the existing state-of-the-art architectures with lower
model complexity. Detailed analyses show that the neurons in SKNet can capture
target objects with different scales, which verifies the capability of neurons
for adaptively adjusting their receptive field sizes according to the input.
The code and models are available at https://github.com/implus/SKNet.Comment: CVPR 201
Question Guided Modular Routing Networks for Visual Question Answering
This paper studies the task of Visual Question Answering (VQA), which is
topical in Multimedia community recently. Particularly, we explore two critical
research problems existed in VQA: (1) efficiently fusing the visual and textual
modalities; (2) enabling the visual reasoning ability of VQA models in
answering complex questions. To address these challenging problems, a novel
Question Guided Modular Routing Networks (QGMRN) has been proposed in this
paper. Particularly, The QGMRN is composed of visual, textual and routing
network. The visual and textual network serve as the backbones for the generic
feature extractors of visual and textual modalities. QGMRN can fuse the visual
and textual modalities at multiple semantic levels. Typically, the visual
reasoning is facilitated by the routing network in a discrete and stochastic
way by using Gumbel-Softmax trick for module selection. When the input reaches
a certain modular layer, routing network newly proposed in this paper,
dynamically selects a portion of modules from that layer to process the input
depending on the question features generated by the textual network. It can
also learn to reason by routing between the generic modules without additional
supervision information or expert knowledge. Benefiting from the dynamic
routing mechanism, QGMRN can outperform the previous classical VQA methods by a
large margin and achieve the competitive results against the state-of-the-art
methods. Furthermore, attention mechanism is integrated into our QGMRN model
and thus can further boost the model performance. Empirically, extensive
experiments on the CLEVR and CLEVR-Humans datasets validate the effectiveness
of our proposed model, and the state-of-the-art performance has been achieved
Natural Language Inference over Interaction Space
Natural Language Inference (NLI) task requires an agent to determine the
logical relationship between a natural language premise and a natural language
hypothesis. We introduce Interactive Inference Network (IIN), a novel class of
neural network architectures that is able to achieve high-level understanding
of the sentence pair by hierarchically extracting semantic features from
interaction space. We show that an interaction tensor (attention weight)
contains semantic information to solve natural language inference, and a denser
interaction tensor contains richer semantic information. One instance of such
architecture, Densely Interactive Inference Network (DIIN), demonstrates the
state-of-the-art performance on large scale NLI copora and large-scale NLI
alike corpus. It's noteworthy that DIIN achieve a greater than 20% error
reduction on the challenging Multi-Genre NLI (MultiNLI) dataset with respect to
the strongest published system.Comment: 15 pages, 2 figures, under review as ICLR proceeding, Published at
Sixth International Conference on Learning Representations, ICLR 201
Integrating Scene Text and Visual Appearance for Fine-Grained Image Classification
Text in natural images contains rich semantics that are often highly relevant
to objects or scene. In this paper, we focus on the problem of fully exploiting
scene text for visual understanding. The main idea is combining word
representations and deep visual features into a globally trainable deep
convolutional neural network. First, the recognized words are obtained by a
scene text reading system. Then, we combine the word embedding of the
recognized words and the deep visual features into a single representation,
which is optimized by a convolutional neural network for fine-grained image
classification. In our framework, the attention mechanism is adopted to reveal
the relevance between each recognized word and the given image, which further
enhances the recognition performance. We have performed experiments on two
datasets: Con-Text dataset and Drink Bottle dataset, that are proposed for
fine-grained classification of business places and drink bottles, respectively.
The experimental results consistently demonstrate that the proposed method
combining textual and visual cues significantly outperforms classification with
only visual representations. Moreover, we have shown that the learned
representation improves the retrieval performance on the drink bottle images by
a large margin, making it potentially useful in product search
Modality Attention for End-to-End Audio-visual Speech Recognition
Audio-visual speech recognition (AVSR) system is thought to be one of the
most promising solutions for robust speech recognition, especially in noisy
environment. In this paper, we propose a novel multimodal attention based
method for audio-visual speech recognition which could automatically learn the
fused representation from both modalities based on their importance. Our method
is realized using state-of-the-art sequence-to-sequence (Seq2seq)
architectures. Experimental results show that relative improvements from 2% up
to 36% over the auditory modality alone are obtained depending on the different
signal-to-noise-ratio (SNR). Compared to the traditional feature concatenation
methods, our proposed approach can achieve better recognition performance under
both clean and noisy conditions. We believe modality attention based end-to-end
method can be easily generalized to other multimodal tasks with correlated
information.Comment: accepted by ICASSP201
- …