3 research outputs found

    Autoencoders Learn Generative Linear Models

    Full text link
    We provide a series of results for unsupervised learning with autoencoders. Specifically, we study shallow two-layer autoencoder architectures with shared weights. We focus on three generative models for data that are common in statistical machine learning: (i) the mixture-of-gaussians model, (ii) the sparse coding model, and (iii) the sparsity model with non-negative coefficients. For each of these models, we prove that under suitable choices of hyperparameters, architectures, and initialization, autoencoders learned by gradient descent can successfully recover the parameters of the corresponding model. To our knowledge, this is the first result that rigorously studies the dynamics of gradient descent for weight-sharing autoencoders. Our analysis can be viewed as theoretical evidence that shallow autoencoder modules indeed can be used as feature learning mechanisms for a variety of data models, and may shed insight on how to train larger stacked architectures with autoencoders as basic building blocks.Comment: Experimental study on synthesis data added. Typos fixe

    Walking the Tightrope: An Investigation of the Convolutional Autoencoder Bottleneck

    Full text link
    In this paper, we present an in-depth investigation of the convolutional autoencoder (CAE) bottleneck. Autoencoders (AE), and especially their convolutional variants, play a vital role in the current deep learning toolbox. Researchers and practitioners employ CAEs for a variety of tasks, ranging from outlier detection and compression to transfer and representation learning. Despite their widespread adoption, we have limited insight into how the bottleneck shape impacts the emergent properties of the CAE. We demonstrate that increased height and width of the bottleneck drastically improves generalization, which in turn leads to better performance of the latent codes in downstream transfer learning tasks. The number of channels in the bottleneck, on the other hand, is secondary in importance. Furthermore, we show empirically that, contrary to popular belief, CAEs do not learn to copy their input, even when the bottleneck has the same number of neurons as there are pixels in the input. Copying does not occur, despite training the CAE for 1,000 epochs on a tiny (≈\approx 600 images) dataset. We believe that the findings in this paper are directly applicable and will lead to improvements in models that rely on CAEs.Comment: code available at https://github.com/IljaManakov/WalkingTheTightrop

    Learning Distributions Generated by One-Layer ReLU Networks

    Full text link
    We consider the problem of estimating the parameters of a dd-dimensional rectified Gaussian distribution from i.i.d. samples. A rectified Gaussian distribution is defined by passing a standard Gaussian distribution through a one-layer ReLU neural network. We give a simple algorithm to estimate the parameters (i.e., the weight matrix and bias vector of the ReLU neural network) up to an error ϵ∣∣W∣∣F\epsilon||W||_F using O~(1/ϵ2)\tilde{O}(1/\epsilon^2) samples and O~(d2/ϵ2)\tilde{O}(d^2/\epsilon^2) time (log factors are ignored for simplicity). This implies that we can estimate the distribution up to ϵ\epsilon in total variation distance using O~(κ2d2/ϵ2)\tilde{O}(\kappa^2d^2/\epsilon^2) samples, where κ\kappa is the condition number of the covariance matrix. Our only assumption is that the bias vector is non-negative. Without this non-negativity assumption, we show that estimating the bias vector within any error requires the number of samples at least exponential in the infinity norm of the bias vector. Our algorithm is based on the key observation that vector norms and pairwise angles can be estimated separately. We use a recent result on learning from truncated samples. We also prove two sample complexity lower bounds: Ω(1/ϵ2)\Omega(1/\epsilon^2) samples are required to estimate the parameters up to error ϵ\epsilon, while Ω(d/ϵ2)\Omega(d/\epsilon^2) samples are necessary to estimate the distribution up to ϵ\epsilon in total variation distance. The first lower bound implies that our algorithm is optimal for parameter estimation. Finally, we show an interesting connection between learning a two-layer generative model and non-negative matrix factorization. Experimental results are provided to support our analysis.Comment: NeurIPS 201
    corecore