1,496 research outputs found
Scalable and Effective Deep CCA via Soft Decorrelation
Recently the widely used multi-view learning model, Canonical Correlation
Analysis (CCA) has been generalised to the non-linear setting via deep neural
networks. Existing deep CCA models typically first decorrelate the feature
dimensions of each view before the different views are maximally correlated in
a common latent space. This feature decorrelation is achieved by enforcing an
exact decorrelation constraint; these models are thus computationally expensive
due to the matrix inversion or SVD operations required for exact decorrelation
at each training iteration. Furthermore, the decorrelation step is often
separated from the gradient descent based optimisation, resulting in
sub-optimal solutions. We propose a novel deep CCA model Soft CCA to overcome
these problems. Specifically, exact decorrelation is replaced by soft
decorrelation via a mini-batch based Stochastic Decorrelation Loss (SDL) to be
optimised jointly with the other training objectives. Extensive experiments
show that the proposed soft CCA is more effective and efficient than existing
deep CCA models. In addition, our SDL loss can be applied to other deep models
beyond multi-view learning, and obtains superior performance compared to
existing decorrelation losses.Comment: To Appear at CVPR201
Deep Multi-View Learning for Visual Understanding
PhD ThesisMulti-view data is the result of an entity being perceived or represented from multiple perspectives. Plenty of applications in visual understanding contain multi-view data. For example, the face images for training a recognition system are usually captured by different devices from multiple angles. This thesis focuses on the cross-view visual recognition problems, e.g., identifying the face images of the same person across different cameras. Several representative multi-view settings, from the supervised multi-view learning to the more challenging unsupervised domain adaptive (UDA) multi-view learning, are investigated. Novel multi-view learning algorithms are proposed correspondingly. To be more specific, the proposed methods are based on the advanced deep neural network (DNN) architectures for better handling visual data. However, directly combining the multi-view learning objectives with DNN can result in different issues, e.g., on scalability, and limit the application scenarios and model performance. Corresponding novelties in DNN methods are thus required to solve them. This thesis is organised into three parts. Each chapter focuses on a multi-view learning setting with novel solutions and is detailed as follows: Chapter 3 A supervised multi-view learning setting with two different views are studied. To recognise the data samples across views, one strategy is aligning them in a common feature space via correlation maximisation. It is also known as canonical correlation analysis (CCA). Deep CCA has been proposed for better performance with the non-linear projection via deep neural networks. Existing deep CCA models typically decorrelate the deep feature dimensions of each view before their Euclidean distances are minimised in the common space. This feature decorrelation is achieved by enforcing an exact decorrelation constraint which is computationally expensive due to the matrix inversion or SVD operations. Therefore, existing deep CCA models are inefficient and have scalability issues. Furthermore, the exact decorrelation is incompatible with the gradient based deep model training and results in sub-optimal solution. To overcome these aforementioned issues, a novel deep CCA model Soft CCA is introduced in this thesis. Specifically, the exact decorrelation is replaced by soft decorrelation via a mini-batch based Stochastic Decorrelation Loss (SDL). It can be jointly optimised with the other training objectives. In addition, our SDL loss can be applied to other deep models beyond multi-view learning. Chapter 4 The supervised multi-view learning setting, whereby more than two views exist, are studied in this chapter. Recently developed deep multi-view learning algorithms either learn a latent visual representation based on a single semantic level and/or require laborious human annotation of these factors as attributes. A novel deep neural network architecture, called Multi- Level Factorisation Net (MLFN), is proposed to automatically factorise the visual appearance into latent discriminative factors at multiple semantic levels without manual annotation. The main purpose is forcing different views share the same latent factors so that they are can be aligned at all layers. Specifically, MLFN is composed of multiple stacked blocks. Each block contains multiple factor modules to model latent factors at a specific level, and factor selection modules that dynamically select the factor modules to interpret the content of each input image. The outputs of the factor selection modules also provide a compact latent factor descriptor that is complementary to the conventional deeply learned feature, and they can be fused efficiently. The effectiveness of the proposed MLFN is demonstrated by not only the large-scale cross-view recognition problems but also the general object categorisation tasks. Chapter 5 The last problem is a special unsupervised domain adaptation setting called unsupervised domain adaptive (UDA) multi-view learning. It contains a fully annotated dataset as the source domain and another unsupervised dataset with relevant tasks as the target domain. The main purpose is to improve the performance of the unlabelled dataset with the annotated data from the other dataset. More importantly, this setting further requires both the source and target domains are multi-view datasets with relevant tasks. Therefore, the assumption of the aligned label space across domains is inappropriate in the UDA multi-view learning. For example, the person re-identification (Re-ID) datasets built on different surveillance scenarios are with images of different people captured and should be given disjoint person identity labels. Existing methods for UDA multi-view learning problems are aligning different domains either in the raw image space or a feature embedding space for domain alignment. In this thesis, a different framework, multi-task learning, is adopted with the domain specific objectives for a common space learning. Specifically, such common space is proposed to enable the knowledge transfer. The conventional supervised losses can be used for the labelled source data while the unsupervised objectives for the target domain play the key roles in domain adaptation. Two novel unsupervised objectives are introduced for UDA multi-view learning and result in two models as below. The first model, termed common factorised space model (CFSM), is built on the assumptions that the semantic latent attributes are shared between the source and target domains since they are relevant multi-view learning tasks. Different from the existing methods that based on domain alignment, CFSM emphasizes on transferring the information across domains via discovering discriminative latent factors in the proposed common space. However, the multi-view data from target domain is without labels. Therefore, an unsupervised factorisation loss is derived and applied on the common space for latent factors discovery across domains. The second model still learns a shared embedding space with multi-view data from both domains but with a different assumption. It attempts to discover the latent correspondence of multi-view data in the unsupervised target data. The target data’s contribution comes from a clustering process. Each cluster thus reveals the underlying cross-view correspondences across multiple views in target domain. To this end, a novel Stochastic Inference for Deep Clustering (SIDC) method is proposed. It reduces self-reinforcing errors that lead to premature convergence to a sub-optimal solution by changing the conventional deterministic cluster assignment to a stochastic one
High frame-rate cardiac ultrasound imaging with deep learning
Cardiac ultrasound imaging requires a high frame rate in order to capture
rapid motion. This can be achieved by multi-line acquisition (MLA), where
several narrow-focused received lines are obtained from each wide-focused
transmitted line. This shortens the acquisition time at the expense of
introducing block artifacts. In this paper, we propose a data-driven
learning-based approach to improve the MLA image quality. We train an
end-to-end convolutional neural network on pairs of real ultrasound cardiac
data, acquired through MLA and the corresponding single-line acquisition (SLA).
The network achieves a significant improvement in image quality for both
and line MLA resulting in a decorrelation measure similar to that of SLA
while having the frame rate of MLA.Comment: To appear in the Proceedings of MICCAI, 201
From neural PCA to deep unsupervised learning
A network supporting deep unsupervised learning is presented. The network is
an autoencoder with lateral shortcut connections from the encoder to decoder at
each level of the hierarchy. The lateral shortcut connections allow the higher
levels of the hierarchy to focus on abstract invariant features. While standard
autoencoders are analogous to latent variable models with a single layer of
stochastic variables, the proposed network is analogous to hierarchical latent
variables models. Learning combines denoising autoencoder and denoising sources
separation frameworks. Each layer of the network contributes to the cost
function a term which measures the distance of the representations produced by
the encoder and the decoder. Since training signals originate from all levels
of the network, all layers can learn efficiently even in deep networks. The
speedup offered by cost terms from higher levels of the hierarchy and the
ability to learn invariant features are demonstrated in experiments.Comment: A revised version of an article that has been accepted for
publication in Advances in Independent Component Analysis and Learning
Machines (2015), edited by Ella Bingham, Samuel Kaski, Jorma Laaksonen and
Jouko Lampine
Linking Image and Text with 2-Way Nets
Linking two data sources is a basic building block in numerous computer
vision problems. Canonical Correlation Analysis (CCA) achieves this by
utilizing a linear optimizer in order to maximize the correlation between the
two views. Recent work makes use of non-linear models, including deep learning
techniques, that optimize the CCA loss in some feature space. In this paper, we
introduce a novel, bi-directional neural network architecture for the task of
matching vectors from two data sources. Our approach employs two tied neural
network channels that project the two views into a common, maximally correlated
space using the Euclidean loss. We show a direct link between the
correlation-based loss and Euclidean loss, enabling the use of Euclidean loss
for correlation maximization. To overcome common Euclidean regression
optimization problems, we modify well-known techniques to our problem,
including batch normalization and dropout. We show state of the art results on
a number of computer vision matching tasks including MNIST image matching and
sentence-image matching on the Flickr8k, Flickr30k and COCO datasets.Comment: 14 pages, 2 figures, 6 table
Unveiling the Potential of Probabilistic Embeddings in Self-Supervised Learning
In recent years, self-supervised learning has played a pivotal role in
advancing machine learning by allowing models to acquire meaningful
representations from unlabeled data. An intriguing research avenue involves
developing self-supervised models within an information-theoretic framework,
but many studies often deviate from the stochasticity assumptions made when
deriving their objectives. To gain deeper insights into this issue, we propose
to explicitly model the representation with stochastic embeddings and assess
their effects on performance, information compression and potential for
out-of-distribution detection. From an information-theoretic perspective, we
seek to investigate the impact of probabilistic modeling on the information
bottleneck, shedding light on a trade-off between compression and preservation
of information in both representation and loss space. Emphasizing the
importance of distinguishing between these two spaces, we demonstrate how
constraining one can affect the other, potentially leading to performance
degradation. Moreover, our findings suggest that introducing an additional
bottleneck in the loss space can significantly enhance the ability to detect
out-of-distribution examples, only leveraging either representation features or
the variance of their underlying distribution.Comment: Under review by AISTATS 202
- …