22 research outputs found

    Partially Exchangeable Networks and Architectures for Learning Summary Statistics in Approximate Bayesian Computation

    Get PDF
    We present a novel family of deep neural architectures, named partially exchangeable networks (PENs) that leverage probabilistic symmetries. By design, PENs are invariant to block-switch transformations, which characterize the partial exchangeability properties of conditionally Markovian processes. Moreover, we show that any block-switch invariant function has a PEN-like representation. The DeepSets architecture is a special case of PEN and we can therefore also target fully exchangeable data. We employ PENs to learn summary statistics in approximate Bayesian computation (ABC). When comparing PENs to previous deep learning methods for learning summary statistics, our results are highly competitive, both considering time series and static models. Indeed, PENs provide more reliable posterior samples even when using less training data.Comment: Forthcoming on the Proceedings of ICML 2019. New comparisons with several different networks. We now use the Wasserstein distance to produce comparisons. Code available on GitHub. 16 pages, 5 figures, 21 table

    Avoiding Kernel Fixed Points: Computing with ELU and GELU Infinite Networks

    Full text link
    Analysing and computing with Gaussian processes arising from infinitely wide neural networks has recently seen a resurgence in popularity. Despite this, many explicit covariance functions of networks with activation functions used in modern networks remain unknown. Furthermore, while the kernels of deep networks can be computed iteratively, theoretical understanding of deep kernels is lacking, particularly with respect to fixed-point dynamics. Firstly, we derive the covariance functions of MLPs with exponential linear units and Gaussian error linear units and evaluate the performance of the limiting Gaussian processes on some benchmarks. Secondly, and more generally, we introduce a framework for analysing the fixed-point dynamics of iterated kernels corresponding to a broad range of activation functions. We find that unlike some previously studied neural network kernels, these new kernels exhibit non-trivial fixed-point dynamics which are mirrored in finite-width neural networks.Comment: 18 pages, 9 figures, 2 tables. Corrected name particle capitalisation and formattin

    Probabilistic symmetries and invariant neural networks

    Full text link
    Treating neural network inputs and outputs as random variables, we characterize the structure of neural networks that can be used to model data that are invariant or equivariant under the action of a compact group. Much recent research has been devoted to encoding invariance under symmetry transformations into neural network architectures, in an effort to improve the performance of deep neural networks in data-scarce, non-i.i.d., or unsupervised settings. By considering group invariance from the perspective of probabilistic symmetry, we establish a link between functional and probabilistic symmetry, and obtain generative functional representations of probability distributions that are invariant or equivariant under the action of a compact group. Our representations completely characterize the structure of neural networks that can be used to model such distributions and yield a general program for constructing invariant stochastic or deterministic neural networks. We demonstrate that examples from the recent literature are special cases, and develop the details of the general program for exchangeable sequences and arrays.Comment: Revised structure for clarity; fixed minor mistakes; incorporated reviewer feedback for publicatio

    Set Representation Learning: A Framework for Learning Gigapixel Images

    Get PDF
    In Machine Learning, we often encounter data as a set of instances such Point Clouds (set of x,y, and z coordinates), patches from gigapixel images (Digital Pathology, Satellite Imagery, Astronomical Images, etc.), Weakly Supervised Learning, Multiple Instance Learning, and so on. It is then convenient to have Machine Learning or AI algorithms that can learn set representation. However, most of the progress made in the last two decades has been limited to single instance-based algorithms and smaller image datasets such as MNIST, CIFAR10, and CIFAR100. In this work, I present novel algorithms for Set Representation Learning. The contribution of this work is two-fold: 1. This work introduces three novel methods for learning Set Representations; Memory based Exchangeable model (MEM), Graph Neural Network based Set Representation Learning method, and a Hierarchical Set Representation Learning method. 2. This work demonstrates that learning gigapixel images can be formulated as a set representation problem and provides a framework for efficiently learning gigapixel image representations. Different themes are explored for Set Representation Learning. This work investigates Permutation Invariant Representations for Set Learning and introduces a new Permutation Invariant method - ‘MEM’. Memory-based Exchangeable (MEM) model uses a Permutation Invariant architecture and memory networks to learn inter-dependencies/relation between different elements of the set. Subsequently, Graph Neural Networks (GNNs) are studied for Set Representation Learning, and a new GNN based Set Representation Learning method is proposed. Motivated by learning inter-dependencies among different elements in MEM, the proposed method learns an equivalent graphical representation to model interaction and interdependencies among different elements of the set. Lastly, this work introduces a new learning scheme for learning Hierarchical Set Representations. To demonstrate the efficacy of the proposed algorithms, they are validated and benchmarked on a variety of synthetic and real-world datasets such as MNIST, Point Clouds, and Gaussian Distributions. Histopathology Images are used to demonstrate the application of Set Representation Learning for learning gigapixel images. State-of-the-art results on all datasets are achieved, thus demonstrating efficacy

    SPECTRE : Spectral Conditioning Helps to Overcome the Expressivity Limits of One-shot Graph Generators

    Full text link
    We approach the graph generation problem from a spectral perspective by first generating the dominant parts of the graph Laplacian spectrum and then building a graph matching these eigenvalues and eigenvectors. Spectral conditioning allows for direct modeling of the global and local graph structure and helps to overcome the expressivity and mode collapse issues of one-shot graph generators. Our novel GAN, called SPECTRE, enables the one-shot generation of much larger graphs than previously possible with one-shot models. SPECTRE outperforms state-of-the-art deep autoregressive generators in terms of modeling fidelity, while also avoiding expensive sequential generation and dependence on node ordering. A case in point, in sizable synthetic and real-world graphs SPECTRE achieves a 4-to-170 fold improvement over the best competitor that does not overfit and is 23-to-30 times faster than autoregressive generators.Comment: 20 pages, 10 figure

    Uncertainty Estimation, Explanation and Reduction with Insufficient Data

    Full text link
    Human beings have been juggling making smart decisions under uncertainties, where we manage to trade off between swift actions and collecting sufficient evidence. It is naturally expected that a generalized artificial intelligence (GAI) to navigate through uncertainties meanwhile predicting precisely. In this thesis, we aim to propose strategies that underpin machine learning with uncertainties from three perspectives: uncertainty estimation, explanation and reduction. Estimation quantifies the variability in the model inputs and outputs. It can endow us to evaluate the model predictive confidence. Explanation provides a tool to interpret the mechanism of uncertainties and to pinpoint the potentials for uncertainty reduction, which focuses on stabilizing model training, especially when the data is insufficient. We hope that this thesis can motivate related studies on quantifying predictive uncertainties in deep learning. It also aims to raise awareness for other stakeholders in the fields of smart transportation and automated medical diagnosis where data insufficiency induces high uncertainty. The thesis is dissected into the following sections: Introduction. we justify the necessity to investigate AI uncertainties and clarify the challenges existed in the latest studies, followed by our research objective. Literature review. We break down the the review of the state-of-the-art methods into uncertainty estimation, explanation and reduction. We make comparisons with the related fields encompassing meta learning, anomaly detection, continual learning as well. Uncertainty estimation. We introduce a variational framework, neural process that approximates Gaussian processes to handle uncertainty estimation. Two variants from the neural process families are proposed to enhance neural processes with scalability and continual learning. Uncertainty explanation. We inspect the functional distribution of neural processes to discover the global and local factors that affect the degree of predictive uncertainties. Uncertainty reduction. We validate the proposed uncertainty framework on two scenarios: urban irregular behaviour detection and neurological disorder diagnosis, where the intrinsic data insufficiency undermines the performance of existing deep learning models. Conclusion. We provide promising directions for future works and conclude the thesis

    Probabilistic Self-supervised Learning via Scoring Rules Minimization

    Full text link
    In this paper, we propose a novel probabilistic self-supervised learning via Scoring Rule Minimization (ProSMIN), which leverages the power of probabilistic models to enhance representation quality and mitigate collapsing representations. Our proposed approach involves two neural networks; the online network and the target network, which collaborate and learn the diverse distribution of representations from each other through knowledge distillation. By presenting the input samples in two augmented formats, the online network is trained to predict the target network representation of the same sample under a different augmented view. The two networks are trained via our new loss function based on proper scoring rules. We provide a theoretical justification for ProSMIN's convergence, demonstrating the strict propriety of its modified scoring rule. This insight validates the method's optimization process and contributes to its robustness and effectiveness in improving representation quality. We evaluate our probabilistic model on various downstream tasks, such as in-distribution generalization, out-of-distribution detection, dataset corruption, low-shot learning, and transfer learning. Our method achieves superior accuracy and calibration, surpassing the self-supervised baseline in a wide range of experiments on large-scale datasets like ImageNet-O and ImageNet-C, ProSMIN demonstrates its scalability and real-world applicability

    Theoretical Explanation of Activation Sparsity through Flat Minima and Adversarial Robustness

    Full text link
    A recent empirical observation of activation sparsity in MLP layers offers an opportunity to drastically reduce computation costs for free. Despite several works attributing it to training dynamics, the theoretical explanation of activation sparsity's emergence is restricted to shallow networks, small training steps well as modified training, even though the sparsity has been found in deep models trained by vanilla protocols for large steps. To fill the three gaps, we propose the notion of gradient sparsity as the source of activation sparsity and a theoretical explanation based on it that explains gradient sparsity and then activation sparsity as necessary steps to adversarial robustness w.r.t. hidden features and parameters, which is approximately the flatness of minima for well-learned models. The theory applies to standardly trained LayerNorm-ed pure MLPs, and further to Transformers or other architectures if noises are added to weights during training. To eliminate other sources of flatness when arguing sparsities' necessity, we discover the phenomenon of spectral concentration, i.e., the ratio between the largest and the smallest non-zero singular values of weight matrices is small. We utilize random matrix theory (RMT) as a powerful theoretical tool to analyze stochastic gradient noises and discuss the emergence of spectral concentration. With these insights, we propose two plug-and-play modules for both training from scratch and sparsity finetuning, as well as one radical modification that only applies to from-scratch training. Another under-testing module for both sparsity and flatness is also immediate from our theories. Validational experiments are conducted to verify our explanation. Experiments for productivity demonstrate modifications' improvement in sparsity, indicating further theoretical cost reduction in both training and inference
    corecore