55,279 research outputs found
Representation learning for cross-modality classification
Differences in scanning parameters or modalities can complicate image analysis based on supervised classification. This paper presents two representation learning approaches, based on autoencoders, that address this problem by learning representations that are similar across domains. Both approaches use, next to the data representation objective, a similarity objective to minimise the difference between representations of corresponding patches from each domain. We evaluated the methods in transfer learning experiments on multi-modal brain MRI data and on synthetic data. After transforming training and test data from different modalities to the common representations learned by our methods, we trained classifiers for each of pair of modalities. We found that adding the similarity term to the standard objective can produce representations that are more similar and can give a higher accuracy in these cross-modality classification experiments
Learning Cross-Modality Representations from Multi-Modal Images
Machine learning algorithms can have difficulties adapting to data from different sources, for example from different imaging modalities. We present and analyze three techniques for unsupervised cross-modality feature learning, using a shared autoencoder-like convolutional network that learns a common representation from multi-modal data. We investigate a form of feature normalization, a learning objective that minimizes crossmodality differences, and modality dropout, in which the network is trained with varying subsets of modalities. We measure the same-modality and cross-modality classification accuracies and explore whether the models learn modality-specific or shared features. This paper presents experiments on two public datasets, with knee images from two MRI modalities, provided by the Osteoarthritis Initiative, and brain tumor segmentation on four MRI modalities from the BRATS challenge. All three approaches improved the cross-modality classification accuracy, with modality dropout and per-feature normalization giving the largest improvement. We observed that the networks tend to learn a combination of cross-modality and modality-specific features. Overall, a combination of all three methods produced the most cross-modality features and the highest cross-modality classification accuracy, while maintaining most of the same-modality accuracy
RGBT Tracking via Progressive Fusion Transformer with Dynamically Guided Learning
Existing Transformer-based RGBT tracking methods either use cross-attention
to fuse the two modalities, or use self-attention and cross-attention to model
both modality-specific and modality-sharing information. However, the
significant appearance gap between modalities limits the feature representation
ability of certain modalities during the fusion process. To address this
problem, we propose a novel Progressive Fusion Transformer called ProFormer,
which progressively integrates single-modality information into the multimodal
representation for robust RGBT tracking. In particular, ProFormer first uses a
self-attention module to collaboratively extract the multimodal representation,
and then uses two cross-attention modules to interact it with the features of
the dual modalities respectively. In this way, the modality-specific
information can well be activated in the multimodal representation. Finally, a
feed-forward network is used to fuse two interacted multimodal representations
for the further enhancement of the final multimodal representation. In
addition, existing learning methods of RGBT trackers either fuse multimodal
features into one for final classification, or exploit the relationship between
unimodal branches and fused branch through a competitive learning strategy.
However, they either ignore the learning of single-modality branches or result
in one branch failing to be well optimized. To solve these problems, we propose
a dynamically guided learning algorithm that adaptively uses well-performing
branches to guide the learning of other branches, for enhancing the
representation ability of each branch. Extensive experiments demonstrate that
our proposed ProFormer sets a new state-of-the-art performance on RGBT210,
RGBT234, LasHeR, and VTUAV datasets.Comment: 13 pages, 9 figure
PRIOR: Prototype Representation Joint Learning from Medical Images and Reports
Contrastive learning based vision-language joint pre-training has emerged as
a successful representation learning strategy. In this paper, we present a
prototype representation learning framework incorporating both global and local
alignment between medical images and reports. In contrast to standard global
multi-modality alignment methods, we employ a local alignment module for
fine-grained representation. Furthermore, a cross-modality conditional
reconstruction module is designed to interchange information across modalities
in the training phase by reconstructing masked images and reports. For
reconstructing long reports, a sentence-wise prototype memory bank is
constructed, enabling the network to focus on low-level localized visual and
high-level clinical linguistic features. Additionally, a non-auto-regressive
generation paradigm is proposed for reconstructing non-sequential reports.
Experimental results on five downstream tasks, including supervised
classification, zero-shot classification, image-to-text retrieval, semantic
segmentation, and object detection, show the proposed method outperforms other
state-of-the-art methods across multiple datasets and under different dataset
size settings. The code is available at https://github.com/QtacierP/PRIOR.Comment: Accepted by ICCV 202
AV-MaskEnhancer: Enhancing Video Representations through Audio-Visual Masked Autoencoder
Learning high-quality video representation has shown significant applications
in computer vision and remains challenging. Previous work based on mask
autoencoders such as ImageMAE and VideoMAE has proven the effectiveness of
learning representations in images and videos through reconstruction strategy
in the visual modality. However, these models exhibit inherent limitations,
particularly in scenarios where extracting features solely from the visual
modality proves challenging, such as when dealing with low-resolution and
blurry original videos. Based on this, we propose AV-MaskEnhancer for learning
high-quality video representation by combining visual and audio information.
Our approach addresses the challenge by demonstrating the complementary nature
of audio and video features in cross-modality content. Moreover, our result of
the video classification task on the UCF101 dataset outperforms the existing
work and reaches the state-of-the-art, with a top-1 accuracy of 98.8% and a
top-5 accuracy of 99.9%
Hierarchical Cross-Modality Knowledge Transfer with Sinkhorn Attention for CTC-based ASR
Due to the modality discrepancy between textual and acoustic modeling,
efficiently transferring linguistic knowledge from a pretrained language model
(PLM) to acoustic encoding for automatic speech recognition (ASR) still remains
a challenging task. In this study, we propose a cross-modality knowledge
transfer (CMKT) learning framework in a temporal connectionist temporal
classification (CTC) based ASR system where hierarchical acoustic alignments
with the linguistic representation are applied. Additionally, we propose the
use of Sinkhorn attention in cross-modality alignment process, where the
transformer attention is a special case of this Sinkhorn attention process. The
CMKT learning is supposed to compel the acoustic encoder to encode rich
linguistic knowledge for ASR. On the AISHELL-1 dataset, with CTC greedy
decoding for inference (without using any language model), we achieved
state-of-the-art performance with 3.64% and 3.94% character error rates (CERs)
for the development and test sets, which corresponding to relative improvements
of 34.18% and 34.88% compared to the baseline CTC-ASR system, respectively.Comment: Submitted to ICASSP 202
CWCL: Cross-Modal Transfer with Continuously Weighted Contrastive Loss
This paper considers contrastive training for cross-modal 0-shot transfer
wherein a pre-trained model in one modality is used for representation learning
in another domain using pairwise data. The learnt models in the latter domain
can then be used for a diverse set of tasks in a zero-shot way, similar to
``Contrastive Language-Image Pre-training (CLIP)'' and ``Locked-image Tuning
(LiT)'' that have recently gained considerable attention. Most existing works
for cross-modal representation alignment (including CLIP and LiT) use the
standard contrastive training objective, which employs sets of positive and
negative examples to align similar and repel dissimilar training data samples.
However, similarity amongst training examples has a more continuous nature,
thus calling for a more `non-binary' treatment. To address this, we propose a
novel loss function called Continuously Weighted Contrastive Loss (CWCL) that
employs a continuous measure of similarity. With CWCL, we seek to align the
embedding space of one modality with another. Owing to the continuous nature of
similarity in the proposed loss function, these models outperform existing
methods for 0-shot transfer across multiple models, datasets and modalities.
Particularly, we consider the modality pairs of image-text and speech-text and
our models achieve 5-8% (absolute) improvement over previous state-of-the-art
methods in 0-shot image classification and 20-30% (absolute) improvement in
0-shot speech-to-intent classification and keyword classification.Comment: Accepted to Neural Information Processing Systems (NeurIPS) 2023
conferenc
- …