15 research outputs found

    Representation and statistical properties of deep neural networks on structured data

    Get PDF
    Significant success of deep learning has brought unprecedented challenges to conventional wisdom in statistics, optimization, and applied mathematics. In many high-dimensional applications, e.g., image data of hundreds of thousands of pixels, deep learning is remarkably scalable and mysteriously generalizes well. Although such appealing behavior stimulates wide applications, a fundamental theoretical challenge -- curse of data dimensionality -- naturally arises. Roughly put, the sample complexity in practical applications is significantly smaller than that predicted by theory. It is a common belief that deep neural networks are good at learning various geometric structures hidden in data sets. However, little theory has been established to explain such a power. This thesis aims to bridge the gap between theory and practice by studying function approximation and statistical theories of deep neural networks in exploitation of geometric structures in data. -- Function Approximation Theories on Low-dimensional Manifolds using Deep Neural Networks. We first develop an efficient universal approximation theory functions on a low-dimensional Riemannian manifold. A feedforward network architecture is constructed for function approximation, where the size of the network grows depending on the manifold dimension. Furthermore, we prove efficient approximation theory for convolutional residual networks in approximating Besov functions. Lastly, we demonstrate the benefit of overparameterized neural networks in function approximation. Specifically, we show that large neural networks are capable of accurately approximating a target function, and the network itself enjoys Lipschitz continuity. -- Statistical Theories on Low-dimensional Data using Deep Neural Networks. Efficient approximation theories of neural networks provide valuable guidelines to properly choose network architectures, when data exhibit geometric structures. In combination with statistical tools, we prove that neural networks can circumvent the curse of data dimensionality and enjoy fast statistical convergence in various learning problems, including nonparametric regression/classification, generative distribution estimation, and doubly-robust policy learning.Ph.D

    A super-polynomial lower bound for learning nonparametric mixtures

    Full text link
    We study the problem of learning nonparametric distributions in a finite mixture, and establish a super-polynomial lower bound on the sample complexity of learning the component distributions in such models. Namely, we are given i.i.d. samples from ff where f=i=1kwifi,i=1kwi=1,wi>0 f=\sum_{i=1}^k w_i f_i, \quad\sum_{i=1}^k w_i=1, \quad w_i>0 and we are interested in learning each component fif_i. Without any assumptions on fif_i, this problem is ill-posed. In order to identify the components fif_i, we assume that each fif_i can be written as a convolution of a Gaussian and a compactly supported density νi\nu_i with supp(νi)supp(νj)=\text{supp}(\nu_i)\cap \text{supp}(\nu_j)=\emptyset. Our main result shows that Ω((1ε)Cloglog1ε)\Omega((\frac{1}{\varepsilon})^{C\log\log \frac{1}{\varepsilon}}) samples are required for estimating each fif_i. The proof relies on a fast rate for approximation with Gaussians, which may be of independent interest. This result has important implications for the hardness of learning more general nonparametric latent variable models that arise in machine learning applications

    Methods for Optimization and Regularization of Generative Models

    Get PDF
    This thesis studies the problem of regularizing and optimizing generative models, often using insights and techniques from kernel methods. The work proceeds in three main themes. Conditional score estimation. We propose a method for estimating conditional densities based on a rich class of RKHS exponential family models. The algorithm works by solving a convex quadratic problem for fitting the gradient of the log density, the score, thus avoiding the need for estimating the normalizing constant. We show the resulting estimator to be consistent and provide convergence rates when the model is well-specified. Structuring and regularizing implicit generative models. In a first contribution, we introduce a method for learning Generative Adversarial Networks, a class of Implicit Generative Models, using a parametric family of Maximum Mean Discrepancies (MMD). We show that controlling the gradient of the critic function defining the MMD is vital for having a sensible loss function. Moreover, we devise a method to enforce exact, analytical gradient constraints. As a second contribution, we introduce and study a new generative model suited for data with low intrinsic dimension embedded in a high dimensional space. This model combines two components: an implicit model, which can learn the low-dimensional support of data, and an energy function, to refine the probability mass by importance sampling on the support of the implicit model. We further introduce algorithms for learning such a hybrid model and for efficient sampling. Optimizing implicit generative models. We first study the Wasserstein gradient flow of the Maximum Mean Discrepancy in a non-parametric setting and provide smoothness conditions on the trajectory of the flow to ensure global convergence. We identify cases when this condition does not hold and propose a new algorithm based on noise injection to mitigate this problem. In a second contribution, we consider the Wasserstein gradient flow of generic loss functionals in a parametric setting. This flow is invariant to the model's parameterization, just like the Fisher gradient flows in information geometry. It has the additional benefit to be well defined even for models with varying supports, which is particularly well suited for implicit generative models. We then introduce a general framework for approximating the Wasserstein natural gradient by leveraging a dual formulation of the Wasserstein pseudo-Riemannian metric that we restrict to a Reproducing Kernel Hilbert Space. The resulting estimator is scalable and provably consistent as it relies on Nystrom methods

    Design and Analysis of Statistical Learning Algorithms which Control False Discoveries

    Get PDF
    In this thesis, general theoretical tools are constructed which can be applied to develop ma- chine learning algorithms which are consistent, with fast convergence and which minimize the generalization error by asymptotically controlling the rate of false discoveries (FDR) of features, especially for high dimensional datasets. Even though the main inspiration of this work comes from biological applications, where the data is extremely high dimensional and often hard to obtain, the developed methods are applicable to any general statistical learning problem. In this work, the various machine learning tasks like hypothesis testing, classification, regression, etc are formulated as risk minimization algorithms. This allows such learning tasks to be viewed as optimization problems, which can be solved using first order optimization techniques in case of large data scenarios, while one could use faster converging second order techniques for small to moderately sized data sets. Further, such a formulation allows us to estimate the first order convergence rates of an empirical risk estimator for any arbitrary learning problem, using techniques from large deviation theory. In many scientific applications, robust discovery of factors affecting an outcome or a phe- notype, is more important than the accuracy of predictions. Hence, it is essential to find an appropriate approach to regularize an under-determined estimation problem and thereby control the generalization error. In this work, the use of local probability of false discovery is explored as such a regularization parameter, which forces the optimized solution towards functions with a lower probability to be a false discovery. Again, techniques from large devi- ation theory and the Gibbs principle allow the derivation of an appropriately regularized cost function. These two theoretical results are then used to develop concrete applications. First, the problem of multi-classification is analyzed, which classifies a sample from an arbitrary proba- bility measure into a finite number of categories, based on a given training data set. A general risk functional is derived, which can be used to learn Bayes optimal classifiers controlling the false discovery rate. Secondly, the problem of model selection in the regression context is considered, aiming to select a subset of given regressors which explains most of the observed variation i.e. perform ANOVA. Again, using techniques mentioned above, a risk function is derived which when optimized, controls the rate of false discoveries. This technique is shown to outperform the popular LASSO algorithm, which can be proven to not control the FDR, but only the FWER. Finally, the problem of inferring under-sampled and partially observed non-negative dis- crete random variables is addressed, which has applications to analyzing RNA sequencing data. By assuming infinite divisibility of the underlying random variable, its characterization as being a discrete Compound Poisson Measure (DCP), is derived. This allows construction of a non-parametric Bayesian model of DCPs with a Pitman-Yor Mixture process prior, which is shown to allow for consistent inference under Kullback-Liebler and Renyi divergences even in the under-sampled regime

    Distribution-Dissimilarities in Machine Learning

    Get PDF
    Any binary classifier (or score-function) can be used to define a dissimilarity between two distributions. Many well-known distribution-dissimilarities are actually classifier-based: total variation, KL- or JS-divergence, Hellinger distance, etc. And many recent popular generative modeling algorithms compute or approximate these distribution-dissimilarities by explicitly training a classifier: e.g. generative adversarial networks (GAN) and their variants. This thesis introduces and studies such classifier-based distribution-dissimilarities. After a general introduction, the first part analyzes the influence of the classifiers' capacity on the dissimilarity's strength for the special case of maximum mean discrepancies (MMD) and provides applications. The second part studies applications of classifier-based distribution-dissimilarities in the context of generative modeling and presents two new algorithms: Wasserstein Auto-Encoders (WAE) and AdaGAN. The third and final part focuses on adversarial examples, i.e. targeted but imperceptible input-perturbations that lead to drastically different predictions of an artificial classifier. It shows that adversarial vulnerability of neural network based classifiers typically increases with the input-dimension, independently of the network topology

    Graph Priors, Optimal Transport, and Deep Learning in Biomedical Discovery

    Get PDF
    Recent advances in biomedical data collection allows the collection of massive datasets measuring thousands of features in thousands to millions of individual cells. This data has the potential to advance our understanding of biological mechanisms at a previously impossible resolution. However, there are few methods to understand data of this scale and type. While neural networks have made tremendous progress on supervised learning problems, there is still much work to be done in making them useful for discovery in data with more difficult to represent supervision. The flexibility and expressiveness of neural networks is sometimes a hindrance in these less supervised domains, as is the case when extracting knowledge from biomedical data. One type of prior knowledge that is more common in biological data comes in the form of geometric constraints. In this thesis, we aim to leverage this geometric knowledge to create scalable and interpretable models to understand this data. Encoding geometric priors into neural network and graph models allows us to characterize the models’ solutions as they relate to the fields of graph signal processing and optimal transport. These links allow us to understand and interpret this datatype. We divide this work into three sections. The first borrows concepts from graph signal processing to construct more interpretable and performant neural networks by constraining and structuring the architecture. The second borrows from the theory of optimal transport to perform anomaly detection and trajectory inference efficiently and with theoretical guarantees. The third examines how to compare distributions over an underlying manifold, which can be used to understand how different perturbations or conditions relate. For this we design an efficient approximation of optimal transport based on diffusion over a joint cell graph. Together, these works utilize our prior understanding of the data geometry to create more useful models of the data. We apply these methods to molecular graphs, images, single-cell sequencing, and health record data
    corecore