6 research outputs found
Efficient Variational Inference for Sparse Deep Learning with Theoretical Guarantee
Sparse deep learning aims to address the challenge of huge storage
consumption by deep neural networks, and to recover the sparse structure of
target functions. Although tremendous empirical successes have been achieved,
most sparse deep learning algorithms are lacking of theoretical support. On the
other hand, another line of works have proposed theoretical frameworks that are
computationally infeasible. In this paper, we train sparse deep neural networks
with a fully Bayesian treatment under spike-and-slab priors, and develop a set
of computationally efficient variational inferences via continuous relaxation
of Bernoulli distribution. The variational posterior contraction rate is
provided, which justifies the consistency of the proposed variational Bayes
method. Notably, our empirical results demonstrate that this variational
procedure provides uncertainty quantification in terms of Bayesian predictive
distribution and is also capable to accomplish consistent variable selection by
training a sparse multi-layer neural network.Comment: Accepted to NeurIPS 202
Understanding the wiring evolution in differentiable neural architecture search
Controversy exists on whether differentiable neural architecture search
methods discover wiring topology effectively. To understand how wiring topology
evolves, we study the underlying mechanism of several existing differentiable
NAS frameworks. Our investigation is motivated by three observed searching
patterns of differentiable NAS: 1) they search by growing instead of pruning;
2) wider networks are more preferred than deeper ones; 3) no edges are selected
in bi-level optimization. To anatomize these phenomena, we propose a unified
view on searching algorithms of existing frameworks, transferring the global
optimization to local cost minimization. Based on this reformulation, we
conduct empirical and theoretical analyses, revealing implicit inductive biases
in the cost's assignment mechanism and evolution dynamics that cause the
observed phenomena. These biases indicate strong discrimination towards certain
topologies. To this end, we pose questions that future differentiable methods
for neural wiring discovery need to confront, hoping to evoke a discussion and
rethinking on how much bias has been enforced implicitly in existing NAS
methods.Comment: AISTATS 202
Adaptive Variational Bayesian Inference for Sparse Deep Neural Network
In this work, we focus on variational Bayesian inference on the sparse Deep
Neural Network (DNN) modeled under a class of spike-and-slab priors. Given a
pre-specified sparse DNN structure, the corresponding variational posterior
contraction rate is characterized that reveals a trade-off between the
variational error and the approximation error, which are both determined by the
network structural complexity (i.e., depth, width and sparsity). However, the
optimal network structure, which strikes the balance of the aforementioned
trade-off and yields the best rate, is generally unknown in reality. Therefore,
our work further develops an {\em adaptive} variational inference procedure
that can automatically select a reasonably good (data-dependent) network
structure that achieves the best contraction rate, without knowing the optimal
network structure. In particular, when the true function is H{\"o}lder smooth,
the adaptive variational inference is capable to attain (near-)optimal rate
without the knowledge of smoothness level. The above rate still suffers from
the curse of dimensionality, and thus motivates the teacher-student setup,
i.e., the true function is a sparse DNN model, under which the rate only
logarithmically depends on the input dimension
Luck Matters: Understanding Training Dynamics of Deep ReLU Networks
We analyze the dynamics of training deep ReLU networks and their implications
on generalization capability. Using a teacher-student setting, we discovered a
novel relationship between the gradient received by hidden student nodes and
the activations of teacher nodes for deep ReLU networks. With this relationship
and the assumption of small overlapping teacher node activations, we prove that
(1) student nodes whose weights are initialized to be close to teacher nodes
converge to them at a faster rate, and (2) in over-parameterized regimes and
2-layer case, while a small set of lucky nodes do converge to the teacher
nodes, the fan-out weights of other nodes converge to zero. This framework
provides insight into multiple puzzling phenomena in deep learning like
over-parameterization, implicit regularization, lottery tickets, etc. We verify
our assumption by showing that the majority of BatchNorm biases of pre-trained
VGG11/16 models are negative. Experiments on (1) random deep teacher networks
with Gaussian inputs, (2) teacher network pre-trained on CIFAR-10 and (3)
extensive ablation studies validate our multiple theoretical predictions
Understanding Self-supervised Learning with Dual Deep Networks
We propose a novel theoretical framework to understand contrastive
self-supervised learning (SSL) methods that employ dual pairs of deep ReLU
networks (e.g., SimCLR). First, we prove that in each SGD update of SimCLR with
various loss functions, including simple contrastive loss, soft Triplet loss
and InfoNCE loss, the weights at each layer are updated by a \emph{covariance
operator} that specifically amplifies initial random selectivities that vary
across data samples but survive averages over data augmentations. To further
study what role the covariance operator plays and which features are learned in
such a process, we model data generation and augmentation processes through a
\emph{hierarchical latent tree model} (HLTM) and prove that the hidden neurons
of deep ReLU networks can learn the latent variables in HLTM, despite the fact
that the network receives \emph{no direct supervision} from these unobserved
latent variables. This leads to a provable emergence of hierarchical features
through the amplification of initially random selectivities through contrastive
SSL. Extensive numerical studies justify our theoretical findings. Code is
released in https://github.com/facebookresearch/luckmatters/tree/master/ssl
Sharp Rate of Convergence for Deep Neural Network Classifiers under the Teacher-Student Setting
Classifiers built with neural networks handle large-scale high dimensional
data, such as facial images from computer vision, extremely well while
traditional statistical methods often fail miserably. In this paper, we attempt
to understand this empirical success in high dimensional classification by
deriving the convergence rates of excess risk. In particular, a teacher-student
framework is proposed that assumes the Bayes classifier to be expressed as ReLU
neural networks. In this setup, we obtain a sharp rate of convergence, i.e.,
, for classifiers trained using either 0-1 loss or hinge
loss. This rate can be further improved to when the data
distribution is separable. Here, denotes the sample size. An interesting
observation is that the data dimension only contributes to the term
in the above rates. This may provide one theoretical explanation for the
empirical successes of deep neural networks in high dimensional classification,
particularly for structured data