206 research outputs found
A Scale Mixture Perspective of Multiplicative Noise in Neural Networks
Corrupting the input and hidden layers of deep neural networks (DNNs) with
multiplicative noise, often drawn from the Bernoulli distribution (or
'dropout'), provides regularization that has significantly contributed to deep
learning's success. However, understanding how multiplicative corruptions
prevent overfitting has been difficult due to the complexity of a DNN's
functional form. In this paper, we show that when a Gaussian prior is placed on
a DNN's weights, applying multiplicative noise induces a Gaussian scale
mixture, which can be reparameterized to circumvent the problematic likelihood
function. Analysis can then proceed by using a type-II maximum likelihood
procedure to derive a closed-form expression revealing how regularization
evolves as a function of the network's weights. Results show that
multiplicative noise forces weights to become either sparse or invariant to
rescaling. We find our analysis has implications for model compression as it
naturally reveals a weight pruning rule that starkly contrasts with the
commonly used signal-to-noise ratio (SNR). While the SNR prunes weights with
large variances, seeing them as noisy, our approach recognizes their robustness
and retains them. We empirically demonstrate our approach has a strong
advantage over the SNR heuristic and is competitive to retraining with soft
targets produced from a teacher model
Bayesian Compression for Deep Learning
Compression and computational efficiency in deep learning have become a
problem of great significance. In this work, we argue that the most principled
and effective way to attack this problem is by adopting a Bayesian point of
view, where through sparsity inducing priors we prune large parts of the
network. We introduce two novelties in this paper: 1) we use hierarchical
priors to prune nodes instead of individual weights, and 2) we use the
posterior uncertainties to determine the optimal fixed point precision to
encode the weights. Both factors significantly contribute to achieving the
state of the art in terms of compression rates, while still staying competitive
with methods designed to optimize for speed or energy efficiency.Comment: Published as a conference paper at NIPS 201
Meta-analysis of functional neuroimaging data using Bayesian nonparametric binary regression
In this work we perform a meta-analysis of neuroimaging data, consisting of
locations of peak activations identified in 162 separate studies on emotion.
Neuroimaging meta-analyses are typically performed using kernel-based methods.
However, these methods require the width of the kernel to be set a priori and
to be constant across the brain. To address these issues, we propose a fully
Bayesian nonparametric binary regression method to perform neuroimaging
meta-analyses. In our method, each location (or voxel) has a probability of
being a peak activation, and the corresponding probability function is based on
a spatially adaptive Gaussian Markov random field (GMRF). We also include
parameters in the model to robustify the procedure against miscoding of the
voxel response. Posterior inference is implemented using efficient MCMC
algorithms extended from those introduced in Holmes and Held [Bayesian Anal. 1
(2006) 145--168]. Our method allows the probability function to be locally
adaptive with respect to the covariates, that is, to be smooth in one region of
the covariate space and wiggly or even discontinuous in another. Posterior
miscoding probabilities for each of the identified voxels can also be obtained,
identifying voxels that may have been falsely classified as being activated.
Simulation studies and application to the emotion neuroimaging data indicate
that our method is superior to standard kernel-based methods.Comment: Published in at http://dx.doi.org/10.1214/11-AOAS523 the Annals of
Applied Statistics (http://www.imstat.org/aoas/) by the Institute of
Mathematical Statistics (http://www.imstat.org
Minimal penalties and the slope heuristics: a survey
Birg{\'e} and Massart proposed in 2001 the slope heuristics as a way to
choose optimally from data an unknown multiplicative constant in front of a
penalty. It is built upon the notion of minimal penalty, and it has been
generalized since to some "minimal-penalty algorithms". This paper reviews the
theoretical results obtained for such algorithms, with a self-contained proof
in the simplest framework, precise proof ideas for further generalizations, and
a few new results. Explicit connections are made with residual-variance
estimators-with an original contribution on this topic, showing that for this
task the slope heuristics performs almost as well as a residual-based estimator
with the best model choice-and some classical algorithms such as L-curve or
elbow heuristics, Mallows' C p , and Akaike's FPE. Practical issues are also
addressed, including two new practical definitions of minimal-penalty
algorithms that are compared on synthetic data to previously-proposed
definitions. Finally, several conjectures and open problems are suggested as
future research directions
A Unified Framework for Sparse Non-Negative Least Squares using Multiplicative Updates and the Non-Negative Matrix Factorization Problem
We study the sparse non-negative least squares (S-NNLS) problem. S-NNLS
occurs naturally in a wide variety of applications where an unknown,
non-negative quantity must be recovered from linear measurements. We present a
unified framework for S-NNLS based on a rectified power exponential scale
mixture prior on the sparse codes. We show that the proposed framework
encompasses a large class of S-NNLS algorithms and provide a computationally
efficient inference procedure based on multiplicative update rules. Such update
rules are convenient for solving large sets of S-NNLS problems simultaneously,
which is required in contexts like sparse non-negative matrix factorization
(S-NMF). We provide theoretical justification for the proposed approach by
showing that the local minima of the objective function being optimized are
sparse and the S-NNLS algorithms presented are guaranteed to converge to a set
of stationary points of the objective function. We then extend our framework to
S-NMF, showing that our framework leads to many well known S-NMF algorithms
under specific choices of prior and providing a guarantee that a popular
subclass of the proposed algorithms converges to a set of stationary points of
the objective function. Finally, we study the performance of the proposed
approaches on synthetic and real-world data.Comment: To appear in Signal Processin
Pénalités minimales et heuristique de pente
International audienceBirgé and Massart proposed in 2001 the slope heuristics as a way to choose optimally from data an unknown multiplicative constant in front of a penalty. It is built upon the notion of minimal penalty, and it has been generalized since to some "minimal-penalty algorithms". This paper reviews the theoretical results obtained for such algorithms, with a self-contained proof in the simplest framework, precise proof ideas for further generalizations, and a few new results. Explicit connections are made with residual-variance estimators-with an original contribution on this topic, showing that for this task the slope heuristics performs almost as well as a residual-based estimator with the best model choice-and some classical algorithms such as L-curve or elbow heuristics, Mallows' C p , and Akaike's FPE. Practical issues are also addressed, including two new practical definitions of minimal-penalty algorithms that are compared on synthetic data to previously-proposed definitions. Finally, several conjectures and open problems are suggested as future research directions.Birgé et Massart ont proposé en 2001 l'heuristique de pente, pour déterminer à l'aide des données une constante multiplicative optimale devant une pénalité en sélection de modèles. Cette heuristique s'appuie sur la notion de pénalité minimale, et elle a depuis été généralisée en "algorithmes à base de pénalités minimales". Cet article passe en revue les résultats théoriques obtenus sur ces algorithmes, avec une preuve complète dans le cadre le plus simple, des idées de preuves précises pour généraliser ce résultat au-delà des cadres déjà étudiés, et quelques résultats nouveaux. Des liens sont faits avec les méthodes d'estimation de la variance résiduelle (avec une contribution originale sur ce thème, qui démontre que l'heuristique de pente produit un estimateur de la variance quasiment aussi bon qu'un estimateur fondé sur les résidus d'un modèle oracle) ainsi qu'avec plusieurs algorithmes classiques tels que les heuristiques de coude (ou de courbe en L), Cp de Mallows et FPE d'Akaike. Les questions de mise en oeuvre pratique sont également étudiées, avec notamment la proposition de deux nouvelles définitions pratiques pour des algorithmes à base de pénalités minimales et leur comparaison aux définitions précédentes sur des données simulées. Enfin, des conjectures et problèmes ouverts sont proposés comme pistes de recherche pour l'avenir
- …