237,756 research outputs found
Extending Multi-modal Contrastive Representations
Multi-modal contrastive representation (MCR) of more than three modalities is
critical in multi-modal learning. Although recent methods showcase impressive
achievements, the high dependence on large-scale, high-quality paired data and
the expensive training costs limit their further development. Inspired by
recent C-MCR, this paper proposes Extending Multimodal Contrastive
Representation (Ex-MCR), a training-efficient and paired-data-free method to
flexibly learn unified contrastive representation space for more than three
modalities by integrating the knowledge of existing MCR spaces. Specifically,
Ex-MCR aligns multiple existing MCRs into the same based MCR, which can
effectively preserve the original semantic alignment of the based MCR. Besides,
we comprehensively enhance the entire learning pipeline for aligning MCR spaces
from the perspectives of training data, architecture, and learning objectives.
With the preserved original modality alignment and the enhanced space
alignment, Ex-MCR shows superior representation learning performance and
excellent modality extensibility. To demonstrate the effectiveness of Ex-MCR,
we align the MCR spaces of CLAP (audio-text) and ULIP (3D-vision) into the CLIP
(vision-text), leveraging the overlapping text and image modality,
respectively. Remarkably, without using any paired data, Ex-MCR learns a
3D-image-text-audio unified contrastive representation, and it achieves
state-of-the-art performance on audio-visual, 3D-image, audio-text, visual-text
retrieval, and 3D object classification tasks. More importantly, extensive
qualitative results further demonstrate the emergent semantic alignment between
the extended modalities (e.g., audio and 3D), which highlights the great
potential of modality extensibility.Comment: Our code is available at https://github.com/MCR-PEFT/Ex-MC
Data-efficient deep representation learning
Current deep learning methods succeed in many data-intensive applications, but they are still not able to produce robust performance due to the lack of training samples. To investigate how to improve the performance of deep learning paradigms when training samples are limited, data-efficient deep representation learning (DDRL) is proposed in this study. DDRL as a sub area of representation learning mainly addresses the following problem: How can the performance of a deep learning method be maintained when the number of training samples is significantly reduced? This is vital for many applications where collecting data is highly costly, such as medical image analysis. Incorporating a certain kind of prior knowledge into the learning paradigm is key to achieving data efficiency.
Deep learning as a sub-area of machine learning can be divided into three parts (locations) in its learning process, namely Data, Optimisation and Model. Integrating prior knowledge into these three locations is expected to bring data efficiency into a learning paradigm, which can dramatically increase the model performance under the condition of limited training data.
In this thesis, we aim to develop novel deep learning methods for achieving data-efficient training, each of which integrates a certain kind of prior knowledge into three different locations respectively. We make the following contributions. First, we propose an iterative solution based on deep learning for medical image segmentation tasks, where dynamical systems are integrated into the segmentation labels in order to improve both performance and data efficiency. The proposed method not only shows a superior performance and better data efficiency compared to the state-of-the-art methods, but also has better interpretability and rotational invariance which are desired for medical imagining applications. Second, we propose a novel training framework which adaptively selects more informative samples for training during the optimization process. The adaptive selection or sampling is performed based on a hardness-aware strategy in the latent space constructed by a generative model.
We show that the proposed framework outperforms a random sampling method, which demonstrates effectiveness of the proposed framework. Thirdly, we propose a deep neural network model which produces the segmentation maps in a coarse-to-fine manner. The proposed architecture is a sequence of computational blocks containing a number of convolutional layers in which each block provides its successive block with a coarser segmentation map as a reference. Such mechanisms enable us to train the network with limited training samples and produce more interpretable results.Open Acces
Prompt-based Distribution Alignment for Unsupervised Domain Adaptation
Recently, despite the unprecedented success of large pre-trained
visual-language models (VLMs) on a wide range of downstream tasks, the
real-world unsupervised domain adaptation (UDA) problem is still not well
explored. Therefore, in this paper, we first experimentally demonstrate that
the unsupervised-trained VLMs can significantly reduce the distribution
discrepancy between source and target domains, thereby improving the
performance of UDA. However, a major challenge for directly deploying such
models on downstream UDA tasks is prompt engineering, which requires aligning
the domain knowledge of source and target domains, since the performance of UDA
is severely influenced by a good domain-invariant representation. We further
propose a Prompt-based Distribution Alignment (PDA) method to incorporate the
domain knowledge into prompt learning. Specifically, PDA employs a two-branch
prompt-tuning paradigm, namely base branch and alignment branch. The base
branch focuses on integrating class-related representation into prompts,
ensuring discrimination among different classes. To further minimize domain
discrepancy, for the alignment branch, we construct feature banks for both the
source and target domains and propose image-guided feature tuning (IFT) to make
the input attend to feature banks, which effectively integrates self-enhanced
and cross-domain features into the model. In this way, these two branches can
be mutually promoted to enhance the adaptation of VLMs for UDA. We conduct
extensive experiments on three benchmarks to demonstrate that our proposed PDA
achieves state-of-the-art performance. The code is available at
https://github.com/BaiShuanghao/Prompt-based-Distribution-Alignment.Comment: 13pages,6figure
- …