116 research outputs found

    Improving Representation Learning for Deep Clustering and Few-shot Learning

    Get PDF
    The amounts of data in the world have increased dramatically in recent years, and it is quickly becoming infeasible for humans to label all these data. It is therefore crucial that modern machine learning systems can operate with few or no labels. The introduction of deep learning and deep neural networks has led to impressive advancements in several areas of machine learning. These advancements are largely due to the unprecedented ability of deep neural networks to learn powerful representations from a wide range of complex input signals. This ability is especially important when labeled data is limited, as the absence of a strong supervisory signal forces models to rely more on intrinsic properties of the data and its representations. This thesis focuses on two key concepts in deep learning with few or no labels. First, we aim to improve representation quality in deep clustering - both for single-view and multi-view data. Current models for deep clustering face challenges related to properly representing semantic similarities, which is crucial for the models to discover meaningful clusterings. This is especially challenging with multi-view data, since the information required for successful clustering might be scattered across many views. Second, we focus on few-shot learning, and how geometrical properties of representations influence few-shot classification performance. We find that a large number of recent methods for few-shot learning embed representations on the hypersphere. Hence, we seek to understand what makes the hypersphere a particularly suitable embedding space for few-shot learning. Our work on single-view deep clustering addresses the susceptibility of deep clustering models to find trivial solutions with non-meaningful representations. To address this issue, we present a new auxiliary objective that - when compared to the popular autoencoder-based approach - better aligns with the main clustering objective, resulting in improved clustering performance. Similarly, our work on multi-view clustering focuses on how representations can be learned from multi-view data, in order to make the representations suitable for the clustering objective. Where recent methods for deep multi-view clustering have focused on aligning view-specific representations, we find that this alignment procedure might actually be detrimental to representation quality. We investigate the effects of representation alignment, and provide novel insights on when alignment is beneficial, and when it is not. Based on our findings, we present several new methods for deep multi-view clustering - both alignment and non-alignment-based - that out-perform current state-of-the-art methods. Our first work on few-shot learning aims to tackle the hubness problem, which has been shown to have negative effects on few-shot classification performance. To this end, we present two new methods to embed representations on the hypersphere for few-shot learning. Further, we provide both theoretical and experimental evidence indicating that embedding representations as uniformly as possible on the hypersphere reduces hubness, and improves classification accuracy. Furthermore, based on our findings on hyperspherical embeddings for few-shot learning, we seek to improve the understanding of representation norms. In particular, we ask what type of information the norm carries, and why it is often beneficial to discard the norm in classification models. We answer this question by presenting a novel hypothesis on the relationship between representation norm and the number of a certain class of objects in the image. We then analyze our hypothesis both theoretically and experimentally, presenting promising results that corroborate the hypothesis

    Crosslingual Document Embedding as Reduced-Rank Ridge Regression

    Get PDF
    There has recently been much interest in extending vector-based word representations to multiple languages, such that words can be compared across languages. In this paper, we shift the focus from words to documents and introduce a method for embedding documents written in any language into a single, language-independent vector space. For training, our approach leverages a multilingual corpus where the same concept is covered in multiple languages (but not necessarily via exact translations), such as Wikipedia. Our method, Cr5 (Crosslingual reduced-rank ridge regression), starts by training a ridge-regression-based classifier that uses language-specific bag-of-word features in order to predict the concept that a given document is about. We show that, when constraining the learned weight matrix to be of low rank, it can be factored to obtain the desired mappings from language-specific bags-of-words to language-independent embeddings. As opposed to most prior methods, which use pretrained monolingual word vectors, postprocess them to make them crosslingual, and finally average word vectors to obtain document vectors, Cr5 is trained end-to-end and is thus natively crosslingual as well as document-level. Moreover, since our algorithm uses the singular value decomposition as its core operation, it is highly scalable. Experiments show that our method achieves state-of-the-art performance on a crosslingual document retrieval task. Finally, although not trained for embedding sentences and words, it also achieves competitive performance on crosslingual sentence and word retrieval tasks.Comment: In The Twelfth ACM International Conference on Web Search and Data Mining (WSDM '19

    Transductive Zero-Shot Action Recognition by Word-Vector Embedding

    Get PDF
    The number of categories for action recognition is growing rapidly and it has become increasingly hard to label sufficient training data for learning conventional models for all categories. Instead of collecting ever more data and labelling them exhaustively for all categories, an attractive alternative approach is zero-shot learning" (ZSL). To that end, in this study we construct a mapping between visual features and a semantic descriptor of each action category, allowing new categories to be recognised in the absence of any visual training data. Existing ZSL studies focus primarily on still images, and attribute-based semantic representations. In this work, we explore word-vectors as the shared semantic space to embed videos and category labels for ZSL action recognition. This is a more challenging problem than existing ZSL of still images and/or attributes, because the mapping between video spacetime features of actions and the semantic space is more complex and harder to learn for the purpose of generalising over any cross-category domain shift. To solve this generalisation problem in ZSL action recognition, we investigate a series of synergistic strategies to improve upon the standard ZSL pipeline. Most of these strategies are transductive in nature which means access to testing data in the training phase.Comment: Accepted by IJC

    Zero-shot learning via discriminative representation extraction

    Get PDF
    Zero-shot learning (ZSL) aims to recognize classes whose samples did not appear during training. Existing research focuses on mapping deep visual feature to semantic embedding space explicitly or implicitly. However, ZSL improvements led by discriminative feature transformation is not well studied. In this paper, we propose a ZSL framework that maps semantic embeddings to a discriminative representation space, which are learned in two different ways: Kernelized Linear Discriminant Analysis (KLDA) and Central-loss based Network (CLN). KLDA and CLN can both force samples to be intra-class aggregation and inter-class separation. With the learned discriminative representations, we map class embeddings to representation space using Kernelized Ridge Regression (KRR). Our experiments show that both KLDA+KRR and CLN+KRR surpass state-of-art approaches in both recognition and retrieval task

    Hubness Reduction Improves Sentence-BERT Semantic Spaces

    Full text link
    Semantic representations of text, i.e. representations of natural language which capture meaning by geometry, are essential for areas such as information retrieval and document grouping. High-dimensional trained dense vectors have received much attention in recent years as such representations. We investigate the structure of semantic spaces that arise from embeddings made with Sentence-BERT and find that the representations suffer from a well-known problem in high dimensions called hubness. Hubness results in asymmetric neighborhood relations, such that some texts (the hubs) are neighbours of many other texts while most texts (so-called anti-hubs), are neighbours of few or no other texts. We quantify the semantic quality of the embeddings using hubness scores and error rate of a neighbourhood based classifier. We find that when hubness is high, we can reduce error rate and hubness using hubness reduction methods. We identify a combination of two methods as resulting in the best reduction. For example, on one of the tested pretrained models, this combined method can reduce hubness by about 75% and error rate by about 9%. Thus, we argue that mitigating hubness in the embedding space provides better semantic representations of text.Comment: Accepted at NLDL 202

    Subspace-based dynamic selection for high-dimensional data

    Get PDF
    The number of features collected has increased greatly in the past decade, particularly in medicine and life sciences, which brings challenges and opportunities. Making reliable predictions, exploring associations and extracting meaningful information in high-dimensional data are some of the problems that are yet to be solved. Due to intrinsic properties of high-dimensional spaces such as distance concentration and hubness, traditional classification and clustering algorithms face difficult challenges. In general, a Multiple Classifier System (MCS) provides better classification accuracy than individual classifiers. One of the most promising approaches to MCS is Dynamic Selection (DS) methods, which work by selecting classifiers on the fly, according to each unknown test sample. The rationale behind this is that not every classifier is an expert in predicting all samples, rather each classifier or a combination of classifiers is an expert in a different region of the feature space; whose quality can significantly impact the overall performance. This thesis provides three major contributions. First, traditional DS methods fail to perform effectively in high-dimensional data sets due to the use of a k-Nearest Neighbour (k-NN) to define the region competence and, moreover, they do not indicate which are the most important features for classification. Second, two frameworks were proposed the Subspace-Based Dynamic Selection (SBDS) and the Classifier SBDS (cSBDS) which integrate characteristics of DS methods and subspace clustering. Subspace clustering methods localise their search for clusters and are able to uncover clusters that exist in multiple, possible overlapping subspaces of features and/or samples. The subspace clustering approach separates the high-dimensional feature space into small feature spaces with a reduced number of features and samples in each one. The results indicate that the cSBDS framework performs statistically better when compared to DS methods and majority voting on real-world and synthetic datasets. Third, we provide a comparison between the features selected by the cSBDS framework and feature importance methods. The results indicate that for high-dimensional datasets, the cSBDS framework is able to capture the most important features when the number of clusters per class is increased, while traditional feature importance methods lose this capability

    Cross-View Learning

    Get PDF
    PhDKey to achieving more efficient machine intelligence is the capability to analysing and understanding data across different views – which can be camera views or modality views (such as visual and textual). One generic learning paradigm for automated understanding data from different views called cross-view learning which includes cross-view matching, cross-view fusion and cross-view generation. Specifically, this thesis investigates two of them, cross-view matching and cross-view generation, by developing new methods for addressing the following specific computer vision problems. The first problem is cross-view matching for person re-identification which a person is captured by multiple non-overlapping camera views, the objective is to match him/her across views among a large number of imposters. Typically a person’s appearance is represented using features of thousands of dimensions, whilst only hundreds of training samples are available due to the difficulties in collecting matched training samples. With the number of training samples much smaller than the feature dimension, the existing methods thus face the classic small sample size (SSS) problem and have to resort to dimensionality reduction techniques and/or matrix regularisation, which lead to loss of discriminative power for cross-view matching. To that end, this thesis proposes to overcome the SSS problem in subspace learning by matching cross-view data in a discriminative null space of the training data. The second problem is cross-view matching for zero-shot learning where data are drawn from different modalities each for a different view (e.g. visual or textual), versus single-modal data considered in the first problem. This is inherently more challenging as the gap between different views becomes larger. Specifically, the zero-shot learning problem can be solved if the visual representation/view of the data (object) and its textual view are matched. Moreover, it requires learning a joint embedding space where different view data can be projected to for nearest neighbour search. This thesis argues that the key to make zero-shot learning models succeed is to choose the right embedding space. Different from most existing zero-shot learning models utilising a textual or an intermediate space as the embedding space for achieving crossview matching, the proposed method uniquely explores the visual space as the embedding space. This thesis finds that in the visual space, the subsequent nearest neighbour search would suffer much less from the hubness problem and thus become more effective. Moreover, a natural mechanism for multiple textual modalities optimised jointly in an end-to-end manner in this model demonstrates significant advantages over existing methods. The last problem is cross-view generation for image captioning which aims to automatically generate textual sentences from visual images. Most existing image captioning studies are limited to investigate variants of deep learning-based image encoders, improving the inputs for the subsequent deep sentence decoders. Existing methods have two limitations: (i) They are trained to maximise the likelihood of each ground-truth word given the previous ground-truth words and the image, termed Teacher-Forcing. This strategy may cause a mismatch between training and testing since at test-time the model uses the previously generated words from the model distribution to predict the next word. This exposure bias can result in error accumulation in sentence generation during test time, since the model has never been exposed to its own predictions. (ii) The training supervision metric, such as the widely used cross entropy loss, is different from the evaluation metrics at test time. In other words, the model is not directly optimised towards the task expectation. This learned model is therefore suboptimal. One main underlying reason responsible is that the evaluation metrics are non-differentiable and therefore much harder to be optimised against. This thesis overcomes the problems as above by exploring the reinforcement learning idea. Specifically, a novel actor-critic based learning approach is formulated to directly maximise the reward - the actual Natural Language Processing quality metrics of interest. As compared to existing reinforcement learning based captioning models, the new method has the unique advantage of a per-token advantage and value computation is enabled leading to better model training
    • …