Search CORE

945 research outputs found

Discovering structure without labels

Author: Damrich Sebastian
Publication venue
Publication date: 01/01/2023
Field of study

The scarcity of labels combined with an abundance of data makes unsupervised learning more attractive than ever. Without annotations, inductive biases must guide the identification of the most salient structure in the data. This thesis contributes to two aspects of unsupervised learning: clustering and dimensionality reduction. The thesis falls into two parts. In the first part, we introduce Mod Shift, a clustering method for point data that uses a distance-based notion of attraction and repulsion to determine the number of clusters and the assignment of points to clusters. It iteratively moves points towards crisp clusters like Mean Shift but also has close ties to the Multicut problem via its loss function. As a result, it connects signed graph partitioning to clustering in Euclidean space. The second part treats dimensionality reduction and, in particular, the prominent neighbor embedding methods UMAP and t-SNE. We analyze the details of UMAP's implementation and find its actual loss function. It differs drastically from the one usually stated. This discrepancy allows us to explain some typical artifacts in UMAP plots, such as the dataset size-dependent tendency to produce overly crisp substructures. Contrary to existing belief, we find that UMAP's high-dimensional similarities are not critical to its success. Based on UMAP's actual loss, we describe its precise connection to the other state-of-the-art visualization method, t-SNE. The key insight is a new, exact relation between the contrastive loss functions negative sampling, employed by UMAP, and noise-contrastive estimation, which has been used to approximate t-SNE. As a result, we explain that UMAP embeddings appear more compact than t-SNE plots due to increased attraction between neighbors. Varying the attraction strength further, we obtain a spectrum of neighbor embedding methods, encompassing both UMAP- and t-SNE-like versions as special cases. Moving from more attraction to more repulsion shifts the focus of the embedding from continuous, global to more discrete and local structure of the data. Finally, we emphasize the link between contrastive neighbor embeddings and self-supervised contrastive learning. We show that different flavors of contrastive losses can work for both of them with few noise samples

Heidelberger Dokumentenserver

A Brief Introduction to Machine Learning for Engineers

Author: Simeone Osvaldo
Publication venue
Publication date: 01/01/2018
Field of study

This monograph aims at providing an introduction to key concepts, algorithms, and theoretical results in machine learning. The treatment concentrates on probabilistic models for supervised and unsupervised learning problems. It introduces fundamental concepts and algorithms by building on first principles, while also exposing the reader to more advanced topics with extensive pointers to the literature, within a unified notation and mathematical framework. The material is organized according to clearly defined categories, such as discriminative and generative models, frequentist and Bayesian approaches, exact and approximate inference, as well as directed and undirected models. This monograph is meant as an entry point for researchers with a background in probability and linear algebra.Comment: This is an expanded and improved version of the original posting. Feedback is welcom

arXiv.org e-Print Archive

King's Research Portal

Regularization, Adaptation and Generalization of Neural Networks

Author: Volpi Riccardo
Publication venue: Universit\ue0 degli studi di Genova
Publication date: 25/02/2019
Field of study

The ability to generalize to unseen data is one of the fundamental, desired properties in a learning system. This thesis reports dierent research eorts in improving the generalization properties of machine learning systems at dierent levels, focusing on neural networks for computer vision tasks. First, a novel regularization method is presented, Curriculum Dropout. It combines Curriculum Learning and Dropout, and shows better regularization eects than the original algorithm in a variety of tasks, without requiring substantially any additional implementation eorts. While regularization methods are extremely powerful to better generalize to unseen data from the same distribution as the training one, they are not very successful in mitigating the dataset bias issue. This problem constitutes in models learning the peculiarities of the training set, and poorly generalizing to unseen domains. Unsupervised domain adaptation has been one of the main solutions to this problem. Two novel adaptation approaches are presented in this thesis. First, we introduce the DIFA algorithm, which combines domain invariance and feature augmentation to better adapt models to new domains by relying on adversarial training. Next, we propose an original procedure that exploits the \mode collapse" behavior of Generative Adversarial Networks. Finally, the general applicability of domain adaptation algorithms is questioned (due to the assumptions of knowing the target distribution a priori and being able to sample from it). A novel framework is presented to overcome its liabilities, where the goal is to generalize to unseen domains by relying only on data from a single source distribution. We face this problem through the lens of robust statistics, dening a worst-case formulation where the model parameters are optimized with respect to populations which are -distant from the source domain on a semantic space

Archivio istituzionale della ricerca - Università di Genova

A precise bare simulation approach to the minimization of some distances. Foundations

Author: Broniatowski Michel
Stummer Wolfgang
Publication venue
Publication date: 04/07/2021
Field of study

In information theory -- as well as in the adjacent fields of statistics, machine learning, artificial intelligence, signal processing and pattern recognition -- many flexibilizations of the omnipresent Kullback-Leibler information distance (relative entropy) and of the closely related Shannon entropy have become frequently used tools. To tackle corresponding constrained minimization (respectively maximization) problems by a newly developed dimension-free bare (pure) simulation method, is the main goal of this paper. Almost no assumptions (like convexity) on the set of constraints are needed, within our discrete setup of arbitrary dimension, and our method is precise (i.e., converges in the limit). As a side effect, we also derive an innovative way of constructing new useful distances/divergences. To illustrate the core of our approach, we present numerous examples. The potential for widespread applicability is indicated, too; in particular, we deliver many recent references for uses of the involved distances/divergences and entropies in various different research fields (which may also serve as an interdisciplinary interface)

arXiv.org e-Print Archive

Recommended from our members

Generalised Bayesian matrix factorisation models

Author: Mohamed Shakir
Publication venue: University of Cambridge
Publication date: 15/03/2011
Field of study

Factor analysis and related models for probabilistic matrix factorisation are of central importance to the unsupervised analysis of data, with a colourful history more than a century long. Probabilistic models for matrix factorisation allow us to explore the underlying structure in data, and have relevance in a vast number of application areas including collaborative filtering, source separation, missing data imputation, gene expression analysis, information retrieval, computational finance and computer vision, amongst others. This thesis develops generalisations of matrix factorisation models that advance our understanding and enhance the applicability of this important class of models. The generalisation of models for matrix factorisation focuses on three concerns: widening the applicability of latent variable models to the diverse types of data that are currently available; considering alternative structural forms in the underlying representations that are inferred; and including higher order data structures into the matrix factorisation framework. These three issues reflect the reality of modern data analysis and we develop new models that allow for a principled exploration and use of data in these settings. We place emphasis on Bayesian approaches to learning and the advantages that come with the Bayesian methodology. Our port of departure is a generalisation of latent variable models to members of the exponential family of distributions. This generalisation allows for the analysis of data that may be real-valued, binary, counts, non-negative or a heterogeneous set of these data types. The model unifies various existing models and constructs for unsupervised settings, the complementary framework to the generalised linear models in regression. Moving to structural considerations, we develop Bayesian methods for learning sparse latent representations. We define ideas of weakly and strongly sparse vectors and investigate the classes of prior distributions that give rise to these forms of sparsity, namely the scale-mixture of Gaussians and the spike-and-slab distribution. Based on these sparsity favouring priors, we develop and compare methods for sparse matrix factorisation and present the first comparison of these sparse learning approaches. As a second structural consideration, we develop models with the ability to generate correlated binary vectors. Moment-matching is used to allow binary data with specified correlation to be generated, based on dichotomisation of the Gaussian distribution. We then develop a novel and simple method for binary PCA based on Gaussian dichotomisation. The third generalisation considers the extension of matrix factorisation models to multi-dimensional arrays of data that are increasingly prevalent. We develop the first Bayesian model for non-negative tensor factorisation and explore the relationship between this model and the previously described models for matrix factorisation.Supported by a Commonwealth Scholarship awarded by the Commonwealth Scholarship and Fellowship Programme (CSFP) [Award number ZACS-2207-363] Supported by award from the National Research Foundation, South Africa (NRF) [Award number SFH2007072200001

Apollo (Cambridge)

Recommended from our members

Statistical methods for the integrative analysis of single-cell multi-omics data

Author: Argelaguet Ricardo
Publication venue: University of Cambridge
Publication date: 25/09/2020
Field of study

Single-cell profiling techniques have provided an unprecedented opportunity to study cellular heterogeneity at the molecular level. This represents a remarkable advance over traditional bulk sequencing methods, particularly to study lineage diversification and cell fate commitment events in heterogeneous biological processes. While the large majority of single-cell studies are focused on quantifying RNA expression, transcriptomic readouts provide only a single dimension of cellular heterogeneity. Recently, technological advances have enabled multiple biological layers to be probed in parallel one cell at a time, unveiling a powerful approach for investigating multiple dimensions of cellular heterogeneity. However, the increasing availability of multi-modal data sets needs to be accompanied by the development of suitable integrative strategies to fully exploit the data generated. In this thesis I worked in collaboration with different research groups to introduce innovative experimental and computational strategies for the integrative study of multi-omics at single-cell resolution. The first contribution is the development of scNMT-seq, a protocol for the simultaneous profiling of RNA expression, DNA methylation and chromatin accessibility in single cells. I demonstrate how this assay provides a powerful approach for investigating regulatory relationships between the epigenome and the transcriptome within individual cells. The second contribution is Multi-Omics Factor Analysis (MOFA), a statistical framework for the unsupervised integration of multi-omics data sets. MOFA is a Bayesian latent variable model that can be viewed as a statistically rigorous generalization of Principal Component Analysis to multi-omics data. The method provides a principled approach to retrieve, in an unsupervised manner, the underlying sources of sample heterogeneity while at the same time disentangling which axes of heterogeneity are shared across multiple modalities and which are specific to individual data modalities. The third contribution is the generation of a comprehensive molecular roadmap of mouse gastrulation at single-cell resolution. We employed scNMT-seq to simultaneously profile RNA expression, DNA methylation and chromatin accessibility for hundreds of cells, spanning multiple time points from the exit from pluripotency to primary germ layer specification. Using MOFA, and other tools, I performed an integrative analysis of the multi-modal measurements, revealing novel insights into the role of the epigenome in regulating this key developmental process. The fourth contribution is an extended formulation of the MOFA model tailored to the analysis of large-scale single-cell data with complex experimental designs. I extended the model to incorporate a flexible regularisation that enables the joint analysis of multiple omics as well as multiple sample groups (batches and/or experimental conditions). In addition, I implemented a GPU-accelerated stochastic variational inference framework, thus enabling the scalable analysis of potentially millions of samples

Apollo (Cambridge)