3,192 research outputs found
Deep Multimodal Semantic Embeddings for Speech and Images
In this paper, we present a model which takes as input a corpus of images
with relevant spoken captions and finds a correspondence between the two
modalities. We employ a pair of convolutional neural networks to model visual
objects and speech signals at the word level, and tie the networks together
with an embedding and alignment model which learns a joint semantic space over
both modalities. We evaluate our model using image search and annotation tasks
on the Flickr8k dataset, which we augmented by collecting a corpus of 40,000
spoken captions using Amazon Mechanical Turk
Learning Modality-Invariant Representations for Speech and Images
In this paper, we explore the unsupervised learning of a semantic embedding
space for co-occurring sensory inputs. Specifically, we focus on the task of
learning a semantic vector space for both spoken and handwritten digits using
the TIDIGITs and MNIST datasets. Current techniques encode image and
audio/textual inputs directly to semantic embeddings. In contrast, our
technique maps an input to the mean and log variance vectors of a diagonal
Gaussian from which sample semantic embeddings are drawn. In addition to
encouraging semantic similarity between co-occurring inputs,our loss function
includes a regularization term borrowed from variational autoencoders (VAEs)
which drives the posterior distributions over embeddings to be unit Gaussian.
We can use this regularization term to filter out modality information while
preserving semantic information. We speculate this technique may be more
broadly applicable to other areas of cross-modality/domain information
retrieval and transfer learning
- …
