22 research outputs found
Learning weakly supervised multimodal phoneme embeddings
Recent works have explored deep architectures for learning multimodal speech
representation (e.g. audio and images, articulation and audio) in a supervised
way. Here we investigate the role of combining different speech modalities,
i.e. audio and visual information representing the lips movements, in a weakly
supervised way using Siamese networks and lexical same-different side
information. In particular, we ask whether one modality can benefit from the
other to provide a richer representation for phone recognition in a weakly
supervised setting. We introduce mono-task and multi-task methods for merging
speech and visual modalities for phone recognition. The mono-task learning
consists in applying a Siamese network on the concatenation of the two
modalities, while the multi-task learning receives several different
combinations of modalities at train time. We show that multi-task learning
enhances discriminability for visual and multimodal inputs while minimally
impacting auditory inputs. Furthermore, we present a qualitative analysis of
the obtained phone embeddings, and show that cross-modal visual input can
improve the discriminability of phonological features which are visually
discernable (rounding, open/close, labial place of articulation), resulting in
representations that are closer to abstract linguistic features than those
based on audio only
Common Representation Learning Using Step-based Correlation Multi-Modal CNN
Deep learning techniques have been successfully used in learning a common
representation for multi-view data, wherein the different modalities are
projected onto a common subspace. In a broader perspective, the techniques used
to investigate common representation learning falls under the categories of
canonical correlation-based approaches and autoencoder based approaches. In
this paper, we investigate the performance of deep autoencoder based methods on
multi-view data. We propose a novel step-based correlation multi-modal CNN
(CorrMCNN) which reconstructs one view of the data given the other while
increasing the interaction between the representations at each hidden layer or
every intermediate step. Finally, we evaluate the performance of the proposed
model on two benchmark datasets - MNIST and XRMB. Through extensive
experiments, we find that the proposed model achieves better performance than
the current state-of-the-art techniques on joint common representation learning
and transfer learning tasks.Comment: Accepted in Asian Conference of Pattern Recognition (ACPR-2017
Stochastic Optimization for Deep CCA via Nonlinear Orthogonal Iterations
Deep CCA is a recently proposed deep neural network extension to the
traditional canonical correlation analysis (CCA), and has been successful for
multi-view representation learning in several domains. However, stochastic
optimization of the deep CCA objective is not straightforward, because it does
not decouple over training examples. Previous optimizers for deep CCA are
either batch-based algorithms or stochastic optimization using large
minibatches, which can have high memory consumption. In this paper, we tackle
the problem of stochastic optimization for deep CCA with small minibatches,
based on an iterative solution to the CCA objective, and show that we can
achieve as good performance as previous optimizers and thus alleviate the
memory requirement.Comment: in 2015 Annual Allerton Conference on Communication, Control and
Computin