Search CORE

3 research outputs found

Autoencoders Learn Generative Linear Models

Author: Hegde Chinmay
Nguyen Thanh V.
Wong Raymond K. W.
Publication venue
Publication date: 15/02/2019
Field of study

We provide a series of results for unsupervised learning with autoencoders. Specifically, we study shallow two-layer autoencoder architectures with shared weights. We focus on three generative models for data that are common in statistical machine learning: (i) the mixture-of-gaussians model, (ii) the sparse coding model, and (iii) the sparsity model with non-negative coefficients. For each of these models, we prove that under suitable choices of hyperparameters, architectures, and initialization, autoencoders learned by gradient descent can successfully recover the parameters of the corresponding model. To our knowledge, this is the first result that rigorously studies the dynamics of gradient descent for weight-sharing autoencoders. Our analysis can be viewed as theoretical evidence that shallow autoencoder modules indeed can be used as feature learning mechanisms for a variety of data models, and may shed insight on how to train larger stacked architectures with autoencoders as basic building blocks.Comment: Experimental study on synthesis data added. Typos fixe

arXiv.org e-Print Archive

Walking the Tightrope: An Investigation of the Convolutional Autoencoder Bottleneck

Author: Manakov Ilja
Rohm Markus
Tresp Volker
Publication venue
Publication date: 12/05/2020
Field of study

In this paper, we present an in-depth investigation of the convolutional autoencoder (CAE) bottleneck. Autoencoders (AE), and especially their convolutional variants, play a vital role in the current deep learning toolbox. Researchers and practitioners employ CAEs for a variety of tasks, ranging from outlier detection and compression to transfer and representation learning. Despite their widespread adoption, we have limited insight into how the bottleneck shape impacts the emergent properties of the CAE. We demonstrate that increased height and width of the bottleneck drastically improves generalization, which in turn leads to better performance of the latent codes in downstream transfer learning tasks. The number of channels in the bottleneck, on the other hand, is secondary in importance. Furthermore, we show empirically that, contrary to popular belief, CAEs do not learn to copy their input, even when the bottleneck has the same number of neurons as there are pixels in the input. Copying does not occur, despite training the CAE for 1,000 epochs on a tiny (

\approx

600 images) dataset. We believe that the findings in this paper are directly applicable and will lead to improvements in models that rely on CAEs.Comment: code available at https://github.com/IljaManakov/WalkingTheTightrop

arXiv.org e-Print Archive

Learning Distributions Generated by One-Layer ReLU Networks

Author: Dimakis Alexandros G.
Sanghavi Sujay
Wu Shanshan
Publication venue
Publication date: 19/09/2019
Field of study

We consider the problem of estimating the parameters of a

d

-dimensional rectified Gaussian distribution from i.i.d. samples. A rectified Gaussian distribution is defined by passing a standard Gaussian distribution through a one-layer ReLU neural network. We give a simple algorithm to estimate the parameters (i.e., the weight matrix and bias vector of the ReLU neural network) up to an error

\epsilon||W||_F

using

\tilde{O}(1/\epsilon^2)

samples and

\tilde{O}(d^2/\epsilon^2)

time (log factors are ignored for simplicity). This implies that we can estimate the distribution up to

\epsilon

in total variation distance using

\tilde{O}(\kappa^2d^2/\epsilon^2)

samples, where

\kappa

is the condition number of the covariance matrix. Our only assumption is that the bias vector is non-negative. Without this non-negativity assumption, we show that estimating the bias vector within any error requires the number of samples at least exponential in the infinity norm of the bias vector. Our algorithm is based on the key observation that vector norms and pairwise angles can be estimated separately. We use a recent result on learning from truncated samples. We also prove two sample complexity lower bounds:

\Omega(1/\epsilon^2)

samples are required to estimate the parameters up to error

\epsilon

, while

\Omega(d/\epsilon^2)

samples are necessary to estimate the distribution up to

\epsilon

in total variation distance. The first lower bound implies that our algorithm is optimal for parameter estimation. Finally, we show an interesting connection between learning a two-layer generative model and non-negative matrix factorization. Experimental results are provided to support our analysis.Comment: NeurIPS 201

arXiv.org e-Print Archive