294,384 research outputs found

    Domain Robustness in Multi-modality Learning and Visual Question Answering

    Get PDF
    Humans perceive the world via multiple modalities, as information from a single modality is usually partial and incomplete. This observation motivates the development of machine learning algorithms capable of handling multi-modal data and performing intelligent reasoning. The recent resurgence of deep learning brings both opportunities and challenges to multi-modal reasoning. On the one hand, its strong representation learning capability provides a unified approach to represent information across multiple modalities. On the other hand, properly training such models typically requires enormous data, which is not always feasible especially for the multi-modal setting. One promising direction to mitigate the lack of data for deep learning models is to transfer knowledge (e.g., gained from solving related problems) to low-resource domains. This procedure is known as transfer learning or domain adaptation, and it has demonstrated great success in various visual and linguistic applications. However, how to effectively transfer knowledge in a multi-modality setting remains a research question. In this thesis, we choose multi-modal reasoning as our target task and aim at improving the performance of deep neural networks on low-resource domains via domain adaptation. We first briefly discuss our prior work about advertisement understanding (as a typical multi-modal reasoning problem) and share our experience from addressing the data-availability challenge. Next, we turn to visual question answering, a more general problem that involves more complicated reasoning. We evaluate mainstream VQA models and classic single-modal domain adaptation strategies and show that existing methods usually suffer significant performance degradation when directly apply to the multi-modal setting. We measure the domain gaps in different modalities and design an effective strategy to manually control domain shifts on individual modalities, which helps better understand the problem. Lastly, we present a systematic study across real datasets to answer a few fundamental questions regarding knowledge transfer in VQA, such as the sensitivity of various models towards different types of supervisions (i.e. unsupervised, self-supervised, semi-supervised, and fully supervised). We conclude by sharing the limitations and our vision for future research directions

    Automated Multi-sequence Cardiac MRI Segmentation Using Supervised Domain Adaptation

    Get PDF
    Left ventricle segmentation and morphological assessment are essential for improving diagnosis and our understanding of cardiomyopathy, which in turn is imperative for reducing risk of myocardial infarctions in patients. Convolutional neural network (CNN) based methods for cardiac magnetic resonance (CMR) image segmentation rely on supervision with pixel-level annotations, and may not generalize well to images from a different domain. These methods are typically sensitive to variations in imaging protocols and data acquisition. Since annotating multi-sequence CMR images is tedious and subject to inter- and intra-observer variations, developing methods that can automatically adapt from one domain to the target domain is of great interest. In this paper, we propose an approach for domain adaptation in multi-sequence CMR segmentation task using transfer learning that combines multi-source image information. We first train an encoder-decoder CNN on T2-weighted and balanced-Steady State Free Precession (bSSFP) MR images with pixel-level annotation and fine-tune the same network with a limited number of Late Gadolinium Enhanced-MR (LGE-MR) subjects, to adapt the domain features. The domain-adapted network was trained with just four LGE-MR training samples and obtained an average Dice score of ∼∼85.0% on the test set comprises of 40 LGE-MR subjects. The proposed method significantly outperformed a network without adaptation trained from scratch on the same set of LGE-MR training data

    Zero-Shot Cross-Lingual Transfer with Meta Learning

    Full text link
    Learning what to share between tasks has been a topic of great importance recently, as strategic sharing of knowledge has been shown to improve downstream task performance. This is particularly important for multilingual applications, as most languages in the world are under-resourced. Here, we consider the setting of training models on multiple different languages at the same time, when little or no data is available for languages other than English. We show that this challenging setup can be approached using meta-learning, where, in addition to training a source language model, another model learns to select which training instances are the most beneficial to the first. We experiment using standard supervised, zero-shot cross-lingual, as well as few-shot cross-lingual settings for different natural language understanding tasks (natural language inference, question answering). Our extensive experimental setup demonstrates the consistent effectiveness of meta-learning for a total of 15 languages. We improve upon the state-of-the-art for zero-shot and few-shot NLI (on MultiNLI and XNLI) and QA (on the MLQA dataset). A comprehensive error analysis indicates that the correlation of typological features between languages can partly explain when parameter sharing learned via meta-learning is beneficial.Comment: Accepted as long paper in EMNLP2020 main conferenc

    Multi-Label Image Classification via Knowledge Distillation from Weakly-Supervised Detection

    Full text link
    Multi-label image classification is a fundamental but challenging task towards general visual understanding. Existing methods found the region-level cues (e.g., features from RoIs) can facilitate multi-label classification. Nevertheless, such methods usually require laborious object-level annotations (i.e., object labels and bounding boxes) for effective learning of the object-level visual features. In this paper, we propose a novel and efficient deep framework to boost multi-label classification by distilling knowledge from weakly-supervised detection task without bounding box annotations. Specifically, given the image-level annotations, (1) we first develop a weakly-supervised detection (WSD) model, and then (2) construct an end-to-end multi-label image classification framework augmented by a knowledge distillation module that guides the classification model by the WSD model according to the class-level predictions for the whole image and the object-level visual features for object RoIs. The WSD model is the teacher model and the classification model is the student model. After this cross-task knowledge distillation, the performance of the classification model is significantly improved and the efficiency is maintained since the WSD model can be safely discarded in the test phase. Extensive experiments on two large-scale datasets (MS-COCO and NUS-WIDE) show that our framework achieves superior performances over the state-of-the-art methods on both performance and efficiency.Comment: accepted by ACM Multimedia 2018, 9 pages, 4 figures, 5 table

    Cross-Task Transfer for Geotagged Audiovisual Aerial Scene Recognition

    Get PDF
    Aerial scene recognition is a fundamental task in remote sensing and has recently received increased interest. While the visual information from overhead images with powerful models and efficient algorithms yields considerable performance on scene recognition, it still suffers from the variation of ground objects, lighting conditions etc. Inspired by the multi-channel perception theory in cognition science, in this paper, for improving the performance on the aerial scene recognition, we explore a novel audiovisual aerial scene recognition task using both images and sounds as input. Based on an observation that some specific sound events are more likely to be heard at a given geographic location, we propose to exploit the knowledge from the sound events to improve the performance on the aerial scene recognition. For this purpose, we have constructed a new dataset named AuDio Visual Aerial sceNe reCognition datasEt (ADVANCE). With the help of this dataset, we evaluate three proposed approaches for transferring the sound event knowledge to the aerial scene recognition task in a multimodal learning framework, and show the benefit of exploiting the audio information for the aerial scene recognition. The source code is publicly available for reproducibility purposes.Comment: ECCV 202
    corecore