116 research outputs found
Improving Representation Learning for Deep Clustering and Few-shot Learning
The amounts of data in the world have increased dramatically in recent years, and it is quickly becoming infeasible for humans to label all these data. It is therefore crucial that modern machine learning systems can operate with few or no labels. The introduction of deep learning and deep neural networks has led to impressive advancements in several areas of machine learning. These advancements are largely due to the unprecedented ability of deep neural networks to learn powerful representations from a wide range of complex input signals. This ability is especially important when labeled data is limited, as the absence of a strong supervisory signal forces models to rely more on intrinsic properties of the data and its representations.
This thesis focuses on two key concepts in deep learning with few or no labels. First, we aim to improve representation quality in deep clustering - both for single-view and multi-view data. Current models for deep clustering face challenges related to properly representing semantic similarities, which is crucial for the models to discover meaningful clusterings. This is especially challenging with multi-view data, since the information required for successful clustering might be scattered across many views. Second, we focus on few-shot learning, and how geometrical properties of representations influence few-shot classification performance. We find that a large number of recent methods for few-shot learning embed representations on the hypersphere. Hence, we seek to understand what makes the hypersphere a particularly suitable embedding space for few-shot learning.
Our work on single-view deep clustering addresses the susceptibility of deep clustering models to find trivial solutions with non-meaningful representations. To address this issue, we present a new auxiliary objective that - when compared to the popular autoencoder-based approach - better aligns with the main clustering objective, resulting in improved clustering performance. Similarly, our work on multi-view clustering focuses on how representations can be learned from multi-view data, in order to make the representations suitable for the clustering objective. Where recent methods for deep multi-view clustering have focused on aligning view-specific representations, we find that this alignment procedure might actually be detrimental to representation quality. We investigate the effects of representation alignment, and provide novel insights on when alignment is beneficial, and when it is not. Based on our findings, we present several new methods for deep multi-view clustering - both alignment and non-alignment-based - that out-perform current state-of-the-art methods.
Our first work on few-shot learning aims to tackle the hubness problem, which has been shown to have negative effects on few-shot classification performance. To this end, we present two new methods to embed representations on the hypersphere for few-shot learning. Further, we provide both theoretical and experimental evidence indicating that embedding representations as uniformly as possible on the hypersphere reduces hubness, and improves classification accuracy. Furthermore, based on our findings on hyperspherical embeddings for few-shot learning, we seek to improve the understanding of representation norms. In particular, we ask what type of information the norm carries, and why it is often beneficial to discard the norm in classification models. We answer this question by presenting a novel hypothesis on the relationship between representation norm and the number of a certain class of objects in the image. We then analyze our hypothesis both theoretically and experimentally, presenting promising results that corroborate the hypothesis
Crosslingual Document Embedding as Reduced-Rank Ridge Regression
There has recently been much interest in extending vector-based word
representations to multiple languages, such that words can be compared across
languages. In this paper, we shift the focus from words to documents and
introduce a method for embedding documents written in any language into a
single, language-independent vector space. For training, our approach leverages
a multilingual corpus where the same concept is covered in multiple languages
(but not necessarily via exact translations), such as Wikipedia. Our method,
Cr5 (Crosslingual reduced-rank ridge regression), starts by training a
ridge-regression-based classifier that uses language-specific bag-of-word
features in order to predict the concept that a given document is about. We
show that, when constraining the learned weight matrix to be of low rank, it
can be factored to obtain the desired mappings from language-specific
bags-of-words to language-independent embeddings. As opposed to most prior
methods, which use pretrained monolingual word vectors, postprocess them to
make them crosslingual, and finally average word vectors to obtain document
vectors, Cr5 is trained end-to-end and is thus natively crosslingual as well as
document-level. Moreover, since our algorithm uses the singular value
decomposition as its core operation, it is highly scalable. Experiments show
that our method achieves state-of-the-art performance on a crosslingual
document retrieval task. Finally, although not trained for embedding sentences
and words, it also achieves competitive performance on crosslingual sentence
and word retrieval tasks.Comment: In The Twelfth ACM International Conference on Web Search and Data
Mining (WSDM '19
Transductive Zero-Shot Action Recognition by Word-Vector Embedding
The number of categories for action recognition is growing rapidly and it has
become increasingly hard to label sufficient training data for learning
conventional models for all categories. Instead of collecting ever more data
and labelling them exhaustively for all categories, an attractive alternative
approach is zero-shot learning" (ZSL). To that end, in this study we construct
a mapping between visual features and a semantic descriptor of each action
category, allowing new categories to be recognised in the absence of any visual
training data. Existing ZSL studies focus primarily on still images, and
attribute-based semantic representations. In this work, we explore word-vectors
as the shared semantic space to embed videos and category labels for ZSL action
recognition. This is a more challenging problem than existing ZSL of still
images and/or attributes, because the mapping between video spacetime features
of actions and the semantic space is more complex and harder to learn for the
purpose of generalising over any cross-category domain shift. To solve this
generalisation problem in ZSL action recognition, we investigate a series of
synergistic strategies to improve upon the standard ZSL pipeline. Most of these
strategies are transductive in nature which means access to testing data in the
training phase.Comment: Accepted by IJC
Zero-shot learning via discriminative representation extraction
Zero-shot learning (ZSL) aims to recognize classes whose samples did not appear during training. Existing research focuses on mapping deep visual feature to semantic embedding space explicitly or implicitly. However, ZSL improvements led by discriminative feature transformation is not well studied. In this paper, we propose a ZSL framework that maps semantic embeddings to a discriminative representation space, which are learned in two different ways: Kernelized Linear Discriminant Analysis (KLDA) and Central-loss based Network (CLN). KLDA and CLN can both force samples to be intra-class aggregation and inter-class separation. With the learned discriminative representations, we map class embeddings to representation space using Kernelized Ridge Regression (KRR). Our experiments show that both KLDA+KRR and CLN+KRR surpass state-of-art approaches in both recognition and retrieval task
Hubness Reduction Improves Sentence-BERT Semantic Spaces
Semantic representations of text, i.e. representations of natural language
which capture meaning by geometry, are essential for areas such as information
retrieval and document grouping. High-dimensional trained dense vectors have
received much attention in recent years as such representations. We investigate
the structure of semantic spaces that arise from embeddings made with
Sentence-BERT and find that the representations suffer from a well-known
problem in high dimensions called hubness. Hubness results in asymmetric
neighborhood relations, such that some texts (the hubs) are neighbours of many
other texts while most texts (so-called anti-hubs), are neighbours of few or no
other texts. We quantify the semantic quality of the embeddings using hubness
scores and error rate of a neighbourhood based classifier. We find that when
hubness is high, we can reduce error rate and hubness using hubness reduction
methods. We identify a combination of two methods as resulting in the best
reduction. For example, on one of the tested pretrained models, this combined
method can reduce hubness by about 75% and error rate by about 9%. Thus, we
argue that mitigating hubness in the embedding space provides better semantic
representations of text.Comment: Accepted at NLDL 202
Subspace-based dynamic selection for high-dimensional data
The number of features collected has increased greatly in the past decade, particularly in medicine and life sciences, which brings challenges and opportunities. Making reliable predictions, exploring associations and extracting meaningful information in high-dimensional data are some of the problems that are yet to be solved. Due to intrinsic properties of high-dimensional spaces such as distance concentration and hubness, traditional classification and clustering algorithms face difficult challenges. In general, a Multiple Classifier System (MCS) provides better classification accuracy than individual classifiers. One of the most promising approaches to MCS is Dynamic Selection (DS) methods, which work by selecting classifiers on the fly, according to each unknown test sample. The rationale behind this is that not every classifier is an expert in predicting all samples, rather each classifier or a combination of classifiers is an expert in a different region of the feature space; whose quality can significantly impact the overall performance.
This thesis provides three major contributions. First, traditional DS methods fail to perform effectively in high-dimensional data sets due to the use of a k-Nearest Neighbour (k-NN) to define the region competence and, moreover, they do not indicate which are the most important features for classification. Second, two frameworks were proposed the Subspace-Based Dynamic Selection (SBDS) and the Classifier SBDS (cSBDS) which integrate characteristics of DS methods and subspace clustering. Subspace clustering methods localise their search for clusters and are able to uncover clusters that exist in multiple, possible overlapping subspaces of features and/or samples. The subspace clustering approach separates the high-dimensional feature space into small feature spaces with a reduced number of features and samples in each one. The results indicate that the cSBDS framework performs statistically better when compared to DS methods and majority voting on real-world and synthetic datasets. Third, we provide a comparison between the features selected by the cSBDS framework and feature importance methods. The results indicate that for high-dimensional datasets, the cSBDS framework is able to capture the most important features when the number of clusters per class is increased, while traditional feature importance methods lose this capability
Cross-View Learning
PhDKey to achieving more efficient machine intelligence is the capability to analysing and understanding
data across different views – which can be camera views or modality views (such as
visual and textual). One generic learning paradigm for automated understanding data from different
views called cross-view learning which includes cross-view matching, cross-view fusion
and cross-view generation. Specifically, this thesis investigates two of them, cross-view matching
and cross-view generation, by developing new methods for addressing the following specific
computer vision problems.
The first problem is cross-view matching for person re-identification which a person is captured
by multiple non-overlapping camera views, the objective is to match him/her across views
among a large number of imposters. Typically a person’s appearance is represented using features
of thousands of dimensions, whilst only hundreds of training samples are available due
to the difficulties in collecting matched training samples. With the number of training samples
much smaller than the feature dimension, the existing methods thus face the classic small sample
size (SSS) problem and have to resort to dimensionality reduction techniques and/or matrix
regularisation, which lead to loss of discriminative power for cross-view matching. To that end,
this thesis proposes to overcome the SSS problem in subspace learning by matching cross-view
data in a discriminative null space of the training data.
The second problem is cross-view matching for zero-shot learning where data are drawn
from different modalities each for a different view (e.g. visual or textual), versus single-modal
data considered in the first problem. This is inherently more challenging as the gap between
different views becomes larger. Specifically, the zero-shot learning problem can be solved if
the visual representation/view of the data (object) and its textual view are matched. Moreover,
it requires learning a joint embedding space where different view data can be projected to for
nearest neighbour search. This thesis argues that the key to make zero-shot learning models succeed
is to choose the right embedding space. Different from most existing zero-shot learning
models utilising a textual or an intermediate space as the embedding space for achieving crossview
matching, the proposed method uniquely explores the visual space as the embedding space.
This thesis finds that in the visual space, the subsequent nearest neighbour search would suffer
much less from the hubness problem and thus become more effective. Moreover, a natural mechanism
for multiple textual modalities optimised jointly in an end-to-end manner in this model
demonstrates significant advantages over existing methods.
The last problem is cross-view generation for image captioning which aims to automatically
generate textual sentences from visual images. Most existing image captioning studies are limited
to investigate variants of deep learning-based image encoders, improving the inputs for the
subsequent deep sentence decoders. Existing methods have two limitations: (i) They are trained
to maximise the likelihood of each ground-truth word given the previous ground-truth words and
the image, termed Teacher-Forcing. This strategy may cause a mismatch between training and
testing since at test-time the model uses the previously generated words from the model distribution
to predict the next word. This exposure bias can result in error accumulation in sentence
generation during test time, since the model has never been exposed to its own predictions. (ii)
The training supervision metric, such as the widely used cross entropy loss, is different from
the evaluation metrics at test time. In other words, the model is not directly optimised towards
the task expectation. This learned model is therefore suboptimal. One main underlying reason
responsible is that the evaluation metrics are non-differentiable and therefore much harder to be
optimised against. This thesis overcomes the problems as above by exploring the reinforcement
learning idea. Specifically, a novel actor-critic based learning approach is formulated to directly
maximise the reward - the actual Natural Language Processing quality metrics of interest. As
compared to existing reinforcement learning based captioning models, the new method has the
unique advantage of a per-token advantage and value computation is enabled leading to better
model training
- …