124 research outputs found
Scalable Methodologies and Analyses for Modality Bias and Feature Exploitation in Language-Vision Multimodal Deep Learning
Multimodal machine learning benchmarks have exponentially grown in both capability and popularity over the last decade. Language-vision question-answering tasks such as Visual Question Answering (VQA) and Video Question Answering (video-QA) have ---thanks to their high difficulty--- become a particularly popular means through which to develop and test new modelling designs and methodology for multimodal deep learning. The challenging nature of VQA and video-QA tasks leaves plenty of room for innovation at every component of the deep learning pipeline: from dataset to modelling methodology. Such circumstances are ideal for innovating in the space of language-vision multimodality. Furthermore, the wider field is currently undergoing an incredible period of growth and increasing interest. I therefore aim to contribute to multiple key components of the VQA and video-QA pipeline, but specifically in a manner such that my contributions remain relevant, ‘scaling’ with the revolutionary new benchmark models and datasets of the near future instead of being rendered obsolete by them. The work in this thesis: highlights and explores the disruptive and problematic presence of language bias in the popular TVQA video-QA dataset, and proposes a dataset-invariant method to identify subsets that respond to different modalities; thoroughly explores the suitability of bilinear pooling as a language-vision fusion technique in video-QA, offering experimental and theoretical insight, and highlighting the parallels in multimodal processing with neurological theories; explores the nascent visual equivalent of languague modelling (`visual modelling') in order to boost the power of visual features; and proposes a dataset-invariant neurolinguistically-inspired labelling scheme for use in multimodal question-answering. I explore the positive and negative results that my experiments across this thesis yield. I conclude by discussing the limitations of my contributions, and conclude with proposals for future directions of study in the areas I contribute to
Deep learning for accelerated magnetic resonance imaging
Medical imaging has aided the biggest advance in the medical domain in the last century. Whilst X-ray, CT, PET and ultrasound are a form of imaging that can be useful in particular scenarios, they each have disadvantages in cost, image quality, ease-of-use and ionising radiation. MRI is a slow imaging protocol which contributes to its high cost to run. However, MRI is a very versatile imaging protocol allowing images of varying contrast to be easily generated whilst not requiring the use of ionising radiation. If MRI can be made to be more efficient and smart, the effective cost of running MRI may be more affordable and accessible. The focus of this thesis is decreasing the acquisition time involved in MRI whilst maintaining the quality of the generated images and thus diagnosis. In particular, we focus on data-driven deep learning approaches that aid in the image reconstruction process and streamline the diagnostic process. We focus on three particular aspects of MR acquisition. Firstly, we investigate the use of motion estimation in the cine reconstruction process. Motion allows us to combine an abundance of imaging data in a learnt reconstruction model allowing acquisitions to be sped up by up to 50 times in extreme scenarios. Secondly, we investigate the possibility of using under-acquired MR data to generate smart diagnoses in the form of automated text reports. In particular, we investigate the possibility of skipping the imaging reconstruction phase altogether at inference time and instead, directly seek to generate radiological text reports for diffusion-weighted brain images in an effort to streamline the diagnostic process. Finally, we investigate the use of probabilistic modelling for MRI reconstruction without the use of fully-acquired data. In particular, we note that acquiring fully-acquired reference images in MRI can be difficult and nonetheless may still contain undesired artefacts that lead to degradation of the dataset and thus the training process. In this chapter, we investigate the possibility of performing reconstruction without fully-acquired references and furthermore discuss the possibility of generating higher quality outputs than that of the fully-acquired references.Open Acces
Recent Advances of Continual Learning in Computer Vision: An Overview
In contrast to batch learning where all training data is available at once,
continual learning represents a family of methods that accumulate knowledge and
learn continuously with data available in sequential order. Similar to the
human learning process with the ability of learning, fusing, and accumulating
new knowledge coming at different time steps, continual learning is considered
to have high practical significance. Hence, continual learning has been studied
in various artificial intelligence tasks. In this paper, we present a
comprehensive review of the recent progress of continual learning in computer
vision. In particular, the works are grouped by their representative
techniques, including regularization, knowledge distillation, memory,
generative replay, parameter isolation, and a combination of the above
techniques. For each category of these techniques, both its characteristics and
applications in computer vision are presented. At the end of this overview,
several subareas, where continuous knowledge accumulation is potentially
helpful while continual learning has not been well studied, are discussed
DeepVATS : Deep Visual Analytics for time series
The field of Deep Visual Analytics (DVA) has recently arisen from the idea of developing Visual Interactive Systems supported by deep learning, in order to provide them with large-scale data processing capabilities and to unify their implementation across different data and domains. In this paper we present DeepVATS, an open-source tool that brings the field of DVA into time series data. DeepVATS trains, in a self-supervised way, a masked time series autoencoder that reconstructs patches of a time series, and projects the knowledge contained in the embeddings of that model in an interactive plot, from which time series patterns and anomalies emerge and can be easily spotted. The tool includes a back-end for data processing pipeline and model training, as well as a front-end with an interactive user interface. We report on results that validate the utility of DeepVATS, running experiments on both synthetic and real datasets. The code is publicly available on https://github.com/vrodriguezf/deepvats
Visual Learning in Limited-Label Regime.
PhD ThesesAbstract
Deep learning algorithms and architectures have greatly advanced the state-of-the-art in a
wide variety of computer vision tasks, such as object recognition and image retrieval. To achieve
human- or even super-human-level performance in most visual recognition tasks, large collections
of labelled data are generally required to formulate meaningful supervision signals for
model training. The standard supervised learning paradigm, however, is undesired in several perspectives.
First, constructing large-scale labelled datasets not only requires exhaustive manual
annotation efforts, but may also be legally prohibited. Second, deep neural networks trained with
full label supervision upon a limited amount of labelled data are weak at generalising to new
unseen data captured from a different data distribution. This thesis targets at solving the critical
problem of lacking sufficient label annotations in deep learning. More specifically, we investigate
four different deep learning paradigms in limited-label regime, including close-set semisupervised
learning, open-set semi-supervised learning, open-set cross-domain learning, and
unsupervised learning. The former two paradigms are explored in visual classification, which
aims to recognise different categories in the images; while the latter two paradigms are studied in
visual search – particularly in person re-identification – which targets at discriminating different
but similar persons in a finer-grained manner and can be extended to the discrimination of other
objects of high visual similarities. We detail our studies of these paradigms as follows.
Chapter 3: Close-Set Semi-Supervised Learning (Figure 1 (I)) is a fundamental semi-supervised
learning paradigm that aims to learn from a small set of labelled data and a large set of unlabelled
data, where the two sets are assumed to lie in the same label space. To address this problem, existing
semi-supervised deep learning methods often rely on the up-to-date “network-in-training”
to formulate the semi-supervised learning objective, which ignores both the disriminative feature
representation and the model inference uncertainty revealed by the network in the preceding
learning iterations, referred to as the memory of model learning. In this work, we proposed to
augment the deep neural network with a lightweight memory mechanism [Chen et al., 2018b],
which captures the underlying manifold structure of the labelled data at the per-class level, and
further imposes auxiliary unsupervised constraints to fit the unlabelled data towards the underlying
manifolds. This work established a simple yet efficient close-set semi-supervised deep
learning scheme to boost model generalisation in visual classification by learning from sparsely
labelled data and abundant unlabelled data.
Chapter 4: Open-Set Semi-Supervised Learning (Figure 1 (II)) further explores the potential
of learning from abundant noisy unlabelled data, While existing SSL methods artificially assume
that small labelled data and large unlabelled data are drawn from the same class distribution, we
consider a more realistic and uncurated open-set semi-supervised learning paradigm. Considering
visual data is always growing in many visual recognition tasks, it is therefore implausible to
pre-define a fixed label space for the unlabelled data in advance. To investigate this new chal4
Limited-Label
Regime
Same Label Space
Labelled
Data Pool
Unlabelled
Data Pool
(I) Close-Set Semi-Supervised Learning
Propagate Label
Chapter 3
(II) Open-Set Semi-Supervised Learning
Labelled
Data Pool
Unlabelled
Partial Shared Data Pool
Label Space
Selectively Propagate Label
(III) Open-Set Cross-Domain Learning
Labelled
Data Pool
Unlabelled
Data Pool
Disjoint
Label Space
& Domains
Transfer Label
[Chen et al. ICCV19]
Unknown Label Space
Unlabelled Data Pool
Discover Label
[Chen et al. BMVC18]
(IV) Unsupervised Learning
Chapter 4
Chapter 6 Chapter 5
[Chen et al. ECCV18] [Chen et al. AAAI20]
Figure 1: An overview of the main studies in this thesis, which covers four different deep learning
paradigms in the limited-label regime, including (I) close-set semi-supervised learning (Chapter
3), (II) open-set semi-supervised learning (Chapter 4), (III) open-set cross-domain learning
(Chapter 5), and (IV) unsupervised learning (Chapter 6). Each chapter studies a specific deep
learning paradigm that requires to propagate, selectively propagate, transfer, or discover label
information for model optimisation, so as to minimise the manual efforts for label annotations.
While the former two paradigms focus on semi-supervised learning for visual classification, i.e.
recognising different visual categories; the latter two paradigms focus on semi-supervised and
unsupervised learning for visual search, i.e. discriminating different instances such as persons.
lenging learning paradigm, we established the first systematic work to tackle the open-set semisupervised
learning problem in visual classification by a novel approach: uncertainty-aware selfdistillation
[Chen et al., 2020b], which selectively propagates the soft label assignments on the
unlabelled visual data for model optimisation. Built upon an accumulative ensembling strategy,
our approach can jointly capture the model uncertainty to discard out-of-distribution samples,
and propagate less overconfident label assignments on the unlabelled data to avoid catastrophic
error propagation. As one of the pioneers to explore this learning paradigm, this work opens up
new avenues for research in more realistic semi-supervised learning scenarios.
Chapter 5: Open-Set Cross-Domain Learning (Figure 1 (III)) is a challenging semi-supervised
learning paradigm of great practical value. When training a visual recognition model in an operating
visual environment (i.e. source domain, such as the laboratory, simulation, or known scene),
and then deploying it to unknown real-world scenes (i.e. target domain), it is likely that the
model would fail to generalise well in the unseen visual target domain, especially when the target
domain data comes from a disjoint label space with heterogeneous domain drift. Unlike prior
works in domain adaptation that mostly consider a shared label space across two domains, we
studied the more demanding open-set domain adaptation problem, where both label spaces and
domains are disjoint across the labelled and unlabelled datasets. To learn from these heterogeneous
datasets, we designed a novel domain context rendering scheme for open-set cross-domain
learning in visual search [Chen et al., 2019a] – particularly for person re-identification, i.e. a realistic
testbed to evaluate the representational power of fine-grained discrimination among very
similar instances. Our key idea is to transfer the source identity labels into diverse target domain
5
contexts. Our approach enables the generation of an abundant amount of synthetic training data
that selectively blend label information from source domain and context information from target
domain. By training upon such synthetic data, our model can learn a more identity-discriminative
and context-invariant representation for effective visual search in the target domain. This work
sets a new state-of-the-art in cross-domain person re-identification and provides a novel and
generic solution for open-set domain adaptation.
Chapter 6: Unsupervised Learning (Figure 1 (IV)) considers the learning scenario with none
labelled data. In this work, we explore unsupervised learning in visual search, particularly for
person re-identification, a realistic testbed to study unsupervised learning, where person identity
labels are generally very difficult to acquire over a wide surveillance space [Chen et al., 2018a].
In contrast to existing methods in person re-identification that requires exhaustive manual efforts
for labelling cross-view pairwise data, we aims to learn visual representations without using any
manual labels. Our generic rationale is to formulate auxiliary supervision signals that learn to
uncover the underlying data distribution, consequently grouping the visual data in a meaningful
and structural way. To learn from the unlabelled data in a fully unsupervised manner, we proposed
a novel deep association learning scheme to uncover the underlying data-to-data association.
Specifically, two unsupervised constraints – temporal consistency and cycle consistency –
are formulated upon neighbourhood consistency to progressively associate visual features within
and across video sequences of tracked persons. This work sets the new state-of-the-art in videobased
unsupervised person re-identification and advances the automatic exploitation of video
data in real-world surveillance.
In summary, the goal of all these studies is to build efficient and scalable visual learning
models in the limited-label regime, which empower to learn more powerful and reliable representations
from complex unlabelled visual data and consequently learn more powerful visual
representations to facilitate better visual recognition and visual search
- …