2,154 research outputs found
End-to-End Cross-Modality Retrieval with CCA Projections and Pairwise Ranking Loss
Cross-modality retrieval encompasses retrieval tasks where the fetched items
are of a different type than the search query, e.g., retrieving pictures
relevant to a given text query. The state-of-the-art approach to cross-modality
retrieval relies on learning a joint embedding space of the two modalities,
where items from either modality are retrieved using nearest-neighbor search.
In this work, we introduce a neural network layer based on Canonical
Correlation Analysis (CCA) that learns better embedding spaces by analytically
computing projections that maximize correlation. In contrast to previous
approaches, the CCA Layer (CCAL) allows us to combine existing objectives for
embedding space learning, such as pairwise ranking losses, with the optimal
projections of CCA. We show the effectiveness of our approach for
cross-modality retrieval on three different scenarios (text-to-image,
audio-sheet-music and zero-shot retrieval), surpassing both Deep CCA and a
multi-view network using freely learned projections optimized by a pairwise
ranking loss, especially when little training data is available (the code for
all three methods is released at: https://github.com/CPJKU/cca_layer).Comment: Preliminary version of a paper published in the International Journal
of Multimedia Information Retrieva
Recurrent Attention Models for Depth-Based Person Identification
We present an attention-based model that reasons on human body shape and
motion dynamics to identify individuals in the absence of RGB information,
hence in the dark. Our approach leverages unique 4D spatio-temporal signatures
to address the identification problem across days. Formulated as a
reinforcement learning task, our model is based on a combination of
convolutional and recurrent neural networks with the goal of identifying small,
discriminative regions indicative of human identity. We demonstrate that our
model produces state-of-the-art results on several published datasets given
only depth images. We further study the robustness of our model towards
viewpoint, appearance, and volumetric changes. Finally, we share insights
gleaned from interpretable 2D, 3D, and 4D visualizations of our model's
spatio-temporal attention.Comment: Computer Vision and Pattern Recognition (CVPR) 201
Multi-Modal Deep Learning to Understand Vision and Language
Developing intelligent agents that can perceive and understand the rich visual world around us has been a long-standing goal in the field of artificial intelligence. In the last few years, significant progress has been made towards this goal and deep learning has been attributed to recent incredible advances in general visual and language understanding. Convolutional neural networks have been used to learn image representations while recurrent neural networks have demonstrated the ability to generate text from visual stimuli. In this thesis, we develop methods and techniques using hybrid convolutional and recurrent neural network architectures that connect visual data and natural language utterances.
Towards appreciating these methods, this work is divided into two broad groups. Firstly, we introduce a general purpose attention mechanism modeled using a continuous function for video understanding. The use of an attention based hierarchical approach along with automatic boundary detection advances state-of-the-art video captioning results. We also develop techniques for summarizing and annotating long videos. In the second part, we introduce architectures along with training techniques to produce a common connection space where natural language sentences are efficiently and accurately connected with visual modalities. In this connection space, similar concepts lie close, while dissimilar concepts lie far apart, irrespective` of their modality. We discuss four modality transformations: visual to text, text to visual, visual to visual and text to text. We introduce a novel attention mechanism to align multi-modal embeddings which are learned through a multi-modal metric loss function. The common vector space is shown to enable bidirectional generation of images and text. The learned common vector space is evaluated on multiple image-text datasets for cross-modal retrieval and zero-shot retrieval. The models are shown to advance the state-of-the-art on tasks that require joint processing of images and natural language
PSACNN: Pulse Sequence Adaptive Fast Whole Brain Segmentation
With the advent of convolutional neural networks~(CNN), supervised learning
methods are increasingly being used for whole brain segmentation. However, a
large, manually annotated training dataset of labeled brain images required to
train such supervised methods is frequently difficult to obtain or create. In
addition, existing training datasets are generally acquired with a homogeneous
magnetic resonance imaging~(MRI) acquisition protocol. CNNs trained on such
datasets are unable to generalize on test data with different acquisition
protocols. Modern neuroimaging studies and clinical trials are necessarily
multi-center initiatives with a wide variety of acquisition protocols. Despite
stringent protocol harmonization practices, it is very difficult to standardize
the gamut of MRI imaging parameters across scanners, field strengths, receive
coils etc., that affect image contrast. In this paper we propose a CNN-based
segmentation algorithm that, in addition to being highly accurate and fast, is
also resilient to variation in the input acquisition. Our approach relies on
building approximate forward models of pulse sequences that produce a typical
test image. For a given pulse sequence, we use its forward model to generate
plausible, synthetic training examples that appear as if they were acquired in
a scanner with that pulse sequence. Sampling over a wide variety of pulse
sequences results in a wide variety of augmented training examples that help
build an image contrast invariant model. Our method trains a single CNN that
can segment input MRI images with acquisition parameters as disparate as
-weighted and -weighted contrasts with only -weighted training
data. The segmentations generated are highly accurate with state-of-the-art
results~(overall Dice overlap), with a fast run time~( 45
seconds), and consistent across a wide range of acquisition protocols.Comment: Typo in author name corrected. Greves -> Grev
- …