11 research outputs found
HAL: Improved Text-Image Matching by Mitigating Visual Semantic Hubs
The hubness problem widely exists in high-dimensional embedding space and is
a fundamental source of error for cross-modal matching tasks. In this work, we
study the emergence of hubs in Visual Semantic Embeddings (VSE) with
application to text-image matching. We analyze the pros and cons of two widely
adopted optimization objectives for training VSE and propose a novel
hubness-aware loss function (HAL) that addresses previous methods' defects.
Unlike (Faghri et al.2018) which simply takes the hardest sample within a
mini-batch, HAL takes all samples into account, using both local and global
statistics to scale up the weights of "hubs". We experiment our method with
various configurations of model architectures and datasets. The method exhibits
exceptionally good robustness and brings consistent improvement on the task of
text-image matching across all settings. Specifically, under the same model
architectures as (Faghri et al. 2018) and (Lee at al. 2018), by switching only
the learning objective, we report a maximum R@1improvement of 7.4% on MS-COCO
and 8.3% on Flickr30k.Comment: AAAI-20 (to appear
Multi-modal learning using deep neural networks
Humans have an incredible ability to process and understand information from multiple sources such as images, video, text, and speech. Recent success of deep neural networks has enabled us to develop algorithms which give machines the ability to understand and interpret this information. Convolutional Neural Networks (CNN) have become a standard in extracting rich features from visual stimuli. Recurrent Neural Networks (RNNs) and its variants such as Long Short Term Memory (LSTMs) units have been highly successful in encoding and decoding sequential information like speech and text. Although these networks are highly successful when applied to narrow applications, there is a need to both broaden their applicability and develop methods which correlate visual information along with semantic content.
This master’s thesis develops a common vector space between images and text. This vector space maps similar concepts, such as pictures of dogs and the word “puppy” close, while mapping disparate concepts far apart. Most cross-modal problems are solved using deep neural networks trained for specific tasks. This research formulates a unified model using CNN and RNN which projects images and text into a common embedding space and also decodes the image and text embeddings into meaningful sentences. This model shows diverse applications in cross modal retrieval, image captioning and sentence paraphrasing and shows promising directions for neural networks to generalize well on different tasks
Multi-Modal Medical Imaging Analysis with Modern Neural Networks
Medical imaging is an important non-invasive tool for diagnostic and treatment purposes in medical practice. However, interpreting medical images is a time consuming and challenging task. Computer-aided diagnosis (CAD) tools have been used in clinical practice to assist medical practitioners in medical imaging analysis since the 1990s. Most of the current generation of CADs are built on conventional computer vision techniques, such as manually defined feature descriptors. Deep convolutional neural networks (CNNs) provide robust end-to-end methods that can automatically learn feature representations. CNNs are a promising building block of next-generation CADs. However, applying CNNs to medical imaging analysis tasks is challenging. This dissertation addresses three major issues that obstruct utilizing modern deep neural networks on medical image analysis tasks---lack of domain knowledge in architecture design, lack of labeled data in model training, and lack of uncertainty estimation in deep neural networks. We evaluated the proposed methods on six large, clinically-relevant datasets. The result shows that the proposed methods can significantly improve the deep neural network performance on medical imaging analysis tasks