504 research outputs found
Deep Multi-View Learning for Visual Understanding
PhD ThesisMulti-view data is the result of an entity being perceived or represented from multiple perspectives. Plenty of applications in visual understanding contain multi-view data. For example, the face images for training a recognition system are usually captured by different devices from multiple angles. This thesis focuses on the cross-view visual recognition problems, e.g., identifying the face images of the same person across different cameras. Several representative multi-view settings, from the supervised multi-view learning to the more challenging unsupervised domain adaptive (UDA) multi-view learning, are investigated. Novel multi-view learning algorithms are proposed correspondingly. To be more specific, the proposed methods are based on the advanced deep neural network (DNN) architectures for better handling visual data. However, directly combining the multi-view learning objectives with DNN can result in different issues, e.g., on scalability, and limit the application scenarios and model performance. Corresponding novelties in DNN methods are thus required to solve them. This thesis is organised into three parts. Each chapter focuses on a multi-view learning setting with novel solutions and is detailed as follows: Chapter 3 A supervised multi-view learning setting with two different views are studied. To recognise the data samples across views, one strategy is aligning them in a common feature space via correlation maximisation. It is also known as canonical correlation analysis (CCA). Deep CCA has been proposed for better performance with the non-linear projection via deep neural networks. Existing deep CCA models typically decorrelate the deep feature dimensions of each view before their Euclidean distances are minimised in the common space. This feature decorrelation is achieved by enforcing an exact decorrelation constraint which is computationally expensive due to the matrix inversion or SVD operations. Therefore, existing deep CCA models are inefficient and have scalability issues. Furthermore, the exact decorrelation is incompatible with the gradient based deep model training and results in sub-optimal solution. To overcome these aforementioned issues, a novel deep CCA model Soft CCA is introduced in this thesis. Specifically, the exact decorrelation is replaced by soft decorrelation via a mini-batch based Stochastic Decorrelation Loss (SDL). It can be jointly optimised with the other training objectives. In addition, our SDL loss can be applied to other deep models beyond multi-view learning. Chapter 4 The supervised multi-view learning setting, whereby more than two views exist, are studied in this chapter. Recently developed deep multi-view learning algorithms either learn a latent visual representation based on a single semantic level and/or require laborious human annotation of these factors as attributes. A novel deep neural network architecture, called Multi- Level Factorisation Net (MLFN), is proposed to automatically factorise the visual appearance into latent discriminative factors at multiple semantic levels without manual annotation. The main purpose is forcing different views share the same latent factors so that they are can be aligned at all layers. Specifically, MLFN is composed of multiple stacked blocks. Each block contains multiple factor modules to model latent factors at a specific level, and factor selection modules that dynamically select the factor modules to interpret the content of each input image. The outputs of the factor selection modules also provide a compact latent factor descriptor that is complementary to the conventional deeply learned feature, and they can be fused efficiently. The effectiveness of the proposed MLFN is demonstrated by not only the large-scale cross-view recognition problems but also the general object categorisation tasks. Chapter 5 The last problem is a special unsupervised domain adaptation setting called unsupervised domain adaptive (UDA) multi-view learning. It contains a fully annotated dataset as the source domain and another unsupervised dataset with relevant tasks as the target domain. The main purpose is to improve the performance of the unlabelled dataset with the annotated data from the other dataset. More importantly, this setting further requires both the source and target domains are multi-view datasets with relevant tasks. Therefore, the assumption of the aligned label space across domains is inappropriate in the UDA multi-view learning. For example, the person re-identification (Re-ID) datasets built on different surveillance scenarios are with images of different people captured and should be given disjoint person identity labels. Existing methods for UDA multi-view learning problems are aligning different domains either in the raw image space or a feature embedding space for domain alignment. In this thesis, a different framework, multi-task learning, is adopted with the domain specific objectives for a common space learning. Specifically, such common space is proposed to enable the knowledge transfer. The conventional supervised losses can be used for the labelled source data while the unsupervised objectives for the target domain play the key roles in domain adaptation. Two novel unsupervised objectives are introduced for UDA multi-view learning and result in two models as below. The first model, termed common factorised space model (CFSM), is built on the assumptions that the semantic latent attributes are shared between the source and target domains since they are relevant multi-view learning tasks. Different from the existing methods that based on domain alignment, CFSM emphasizes on transferring the information across domains via discovering discriminative latent factors in the proposed common space. However, the multi-view data from target domain is without labels. Therefore, an unsupervised factorisation loss is derived and applied on the common space for latent factors discovery across domains. The second model still learns a shared embedding space with multi-view data from both domains but with a different assumption. It attempts to discover the latent correspondence of multi-view data in the unsupervised target data. The target data’s contribution comes from a clustering process. Each cluster thus reveals the underlying cross-view correspondences across multiple views in target domain. To this end, a novel Stochastic Inference for Deep Clustering (SIDC) method is proposed. It reduces self-reinforcing errors that lead to premature convergence to a sub-optimal solution by changing the conventional deterministic cluster assignment to a stochastic one
Recommended from our members
Inductive Bias and Modular Design for Sample-Efficient Neural Language Learning
Most of the world's languages suffer from the paucity of annotated data. This curbs the effectiveness of supervised learning, the most widespread approach to modelling language. Instead, an alternative paradigm could take inspiration from the propensity of children to acquire language from limited stimuli, in order to enable machines to learn any new language from a few examples. The abstract mechanisms underpinning this ability include 1) a set of in-born inductive biases and 2) the deep entrenchment of language in other perceptual and cognitive faculties, combined with the ability to transfer and recombine knowledge across these domains. The main contribution of my thesis is giving concrete form to both these intuitions.
Firstly, I argue that endowing a neural network with the correct inductive biases is equivalent to constructing a prior distribution over its weights and its architecture (including connectivity patterns and non-linear activations). This prior is inferred by "reverse-engineering" a representative set of observed languages and harnessing typological features documented by linguists. Thus, I provide a unified framework for cross-lingual transfer and architecture search by recasting them as hierarchical Bayesian neural models.
Secondly, the skills relevant to different language varieties and different tasks in natural language processing are deeply intertwined. Hence, the neural weights modelling the data for each of their combinations can be imagined as lying in a structured space. I introduce a Bayesian generative model of this space, which is factorised into latent variables representing each language and each task. By virtue of this modular design, predictions can generalise to unseen combinations by extrapolating from the data of observed combinations.
The proposed models are empirically validated on a spectrum of language-related tasks (character-level language modelling, part-of-speech tagging, named entity recognition, and common-sense reasoning) and a typologically diverse sample of about a hundred languages. Compared to a series of competitive baselines, they achieve better performances in new languages in zero-shot and few-shot learning settings. In general, they hold promise to extend state-of-the-art language technology to under-resourced languages by means of sample efficiency and robustness to the cross-lingual variation.ERC (Consolidator Grant 648909) Lexical
Google Research Faculty Award 201
Multi-Task Zero-Shot Action Recognition with Prioritised Data Augmentation
Zero-Shot Learning (ZSL) promises to scale visual recognition by bypassing
the conventional model training requirement of annotated examples for every
category. This is achieved by establishing a mapping connecting low-level
features and a semantic description of the label space, referred as
visual-semantic mapping, on auxiliary data. Reusing the learned mapping to
project target videos into an embedding space thus allows novel-classes to be
recognised by nearest neighbour inference. However, existing ZSL methods suffer
from auxiliary-target domain shift intrinsically induced by assuming the same
mapping for the disjoint auxiliary and target classes. This compromises the
generalisation accuracy of ZSL recognition on the target data. In this work, we
improve the ability of ZSL to generalise across this domain shift in both
model- and data-centric ways by formulating a visual-semantic mapping with
better generalisation properties and a dynamic data re-weighting method to
prioritise auxiliary data that are relevant to the target classes.
Specifically: (1) We introduce a multi-task visual-semantic mapping to improve
generalisation by constraining the semantic mapping parameters to lie on a
low-dimensional manifold, (2) We explore prioritised data augmentation by
expanding the pool of auxiliary data with additional instances weighted by
relevance to the target domain. The proposed new model is applied to the
challenging zero-shot action recognition problem to demonstrate its advantages
over existing ZSL models.Comment: Published in ECCV 201
Knowledge sharing: From atomic to parametrised context and shallow to deep models
PhDKey to achieving more effective machine intelligence is the capability to generalise knowledge
across different contexts. In this thesis, we develop a new and very general perspective
on knowledge sharing that unifi es and generalises many existing methodologies,
while being practically effective, simple to implement, and opening up new problem settings.
Knowledge sharing across tasks and domains has conventionally been studied disparately.
We fi rst introduce the concept of a semantic descriptor and a
flexible neural network approach to knowledge sharing that together unify multi-task/multi-domain
learning, and encompass various classic and recent multi-domain learning (MDL) and
multi-task learning (MTL) algorithms as special cases.
We next generalise this framework from single-output to multi-output problems and
from shallow to deep models. To achieve this, we establish the equivalence between
classic tensor decomposition methods, and specifi c neural network architectures. This
makes it possible to implement our framework within modern deep learning stacks. We
present both explicit low-rank, and trace norm regularisation solutions.
From a practical perspective, we also explore a new problem setting of zero-shot
domain adaptation (ZSDA) where a model can be calibrated solely based on some
abstract information of a new domain, e.g., some metadata like the capture device of
photos, without collecting or labelling the data
Deeper, Broader and Artier Domain Generalization
The problem of domain generalization is to learn from multiple training
domains, and extract a domain-agnostic model that can then be applied to an
unseen domain. Domain generalization (DG) has a clear motivation in contexts
where there are target domains with distinct characteristics, yet sparse data
for training. For example recognition in sketch images, which are distinctly
more abstract and rarer than photos. Nevertheless, DG methods have primarily
been evaluated on photo-only benchmarks focusing on alleviating the dataset
bias where both problems of domain distinctiveness and data sparsity can be
minimal. We argue that these benchmarks are overly straightforward, and show
that simple deep learning baselines perform surprisingly well on them. In this
paper, we make two main contributions: Firstly, we build upon the favorable
domain shift-robust properties of deep learning methods, and develop a low-rank
parameterized CNN model for end-to-end DG learning. Secondly, we develop a DG
benchmark dataset covering photo, sketch, cartoon and painting domains. This is
both more practically relevant, and harder (bigger domain shift) than existing
benchmarks. The results show that our method outperforms existing DG
alternatives, and our dataset provides a more significant DG challenge to drive
future research.Comment: 9 pages, 4 figures, ICCV 201
- …