237,691 research outputs found

    Extending Multi-modal Contrastive Representations

    Full text link
    Multi-modal contrastive representation (MCR) of more than three modalities is critical in multi-modal learning. Although recent methods showcase impressive achievements, the high dependence on large-scale, high-quality paired data and the expensive training costs limit their further development. Inspired by recent C-MCR, this paper proposes Extending Multimodal Contrastive Representation (Ex-MCR), a training-efficient and paired-data-free method to flexibly learn unified contrastive representation space for more than three modalities by integrating the knowledge of existing MCR spaces. Specifically, Ex-MCR aligns multiple existing MCRs into the same based MCR, which can effectively preserve the original semantic alignment of the based MCR. Besides, we comprehensively enhance the entire learning pipeline for aligning MCR spaces from the perspectives of training data, architecture, and learning objectives. With the preserved original modality alignment and the enhanced space alignment, Ex-MCR shows superior representation learning performance and excellent modality extensibility. To demonstrate the effectiveness of Ex-MCR, we align the MCR spaces of CLAP (audio-text) and ULIP (3D-vision) into the CLIP (vision-text), leveraging the overlapping text and image modality, respectively. Remarkably, without using any paired data, Ex-MCR learns a 3D-image-text-audio unified contrastive representation, and it achieves state-of-the-art performance on audio-visual, 3D-image, audio-text, visual-text retrieval, and 3D object classification tasks. More importantly, extensive qualitative results further demonstrate the emergent semantic alignment between the extended modalities (e.g., audio and 3D), which highlights the great potential of modality extensibility.Comment: Our code is available at https://github.com/MCR-PEFT/Ex-MC

    Data-efficient deep representation learning

    Get PDF
    Current deep learning methods succeed in many data-intensive applications, but they are still not able to produce robust performance due to the lack of training samples. To investigate how to improve the performance of deep learning paradigms when training samples are limited, data-efficient deep representation learning (DDRL) is proposed in this study. DDRL as a sub area of representation learning mainly addresses the following problem: How can the performance of a deep learning method be maintained when the number of training samples is significantly reduced? This is vital for many applications where collecting data is highly costly, such as medical image analysis. Incorporating a certain kind of prior knowledge into the learning paradigm is key to achieving data efficiency. Deep learning as a sub-area of machine learning can be divided into three parts (locations) in its learning process, namely Data, Optimisation and Model. Integrating prior knowledge into these three locations is expected to bring data efficiency into a learning paradigm, which can dramatically increase the model performance under the condition of limited training data. In this thesis, we aim to develop novel deep learning methods for achieving data-efficient training, each of which integrates a certain kind of prior knowledge into three different locations respectively. We make the following contributions. First, we propose an iterative solution based on deep learning for medical image segmentation tasks, where dynamical systems are integrated into the segmentation labels in order to improve both performance and data efficiency. The proposed method not only shows a superior performance and better data efficiency compared to the state-of-the-art methods, but also has better interpretability and rotational invariance which are desired for medical imagining applications. Second, we propose a novel training framework which adaptively selects more informative samples for training during the optimization process. The adaptive selection or sampling is performed based on a hardness-aware strategy in the latent space constructed by a generative model. We show that the proposed framework outperforms a random sampling method, which demonstrates effectiveness of the proposed framework. Thirdly, we propose a deep neural network model which produces the segmentation maps in a coarse-to-fine manner. The proposed architecture is a sequence of computational blocks containing a number of convolutional layers in which each block provides its successive block with a coarser segmentation map as a reference. Such mechanisms enable us to train the network with limited training samples and produce more interpretable results.Open Acces

    Prompt-based Distribution Alignment for Unsupervised Domain Adaptation

    Full text link
    Recently, despite the unprecedented success of large pre-trained visual-language models (VLMs) on a wide range of downstream tasks, the real-world unsupervised domain adaptation (UDA) problem is still not well explored. Therefore, in this paper, we first experimentally demonstrate that the unsupervised-trained VLMs can significantly reduce the distribution discrepancy between source and target domains, thereby improving the performance of UDA. However, a major challenge for directly deploying such models on downstream UDA tasks is prompt engineering, which requires aligning the domain knowledge of source and target domains, since the performance of UDA is severely influenced by a good domain-invariant representation. We further propose a Prompt-based Distribution Alignment (PDA) method to incorporate the domain knowledge into prompt learning. Specifically, PDA employs a two-branch prompt-tuning paradigm, namely base branch and alignment branch. The base branch focuses on integrating class-related representation into prompts, ensuring discrimination among different classes. To further minimize domain discrepancy, for the alignment branch, we construct feature banks for both the source and target domains and propose image-guided feature tuning (IFT) to make the input attend to feature banks, which effectively integrates self-enhanced and cross-domain features into the model. In this way, these two branches can be mutually promoted to enhance the adaptation of VLMs for UDA. We conduct extensive experiments on three benchmarks to demonstrate that our proposed PDA achieves state-of-the-art performance. The code is available at https://github.com/BaiShuanghao/Prompt-based-Distribution-Alignment.Comment: 13pages,6figure
    • …
    corecore