55 research outputs found
Understanding BatchNorm in Ternary Training
Neural networks are comprised of two components, weights andactivation function. Ternary weight neural networks (TNNs) achievea good performance and offer up to 16x compression ratio. TNNsare difficult to train without BatchNorm and there has been no studyto clarify the role of BatchNorm in a ternary network. Benefitingfrom a study in binary networks, we show how BatchNorm helps inresolving the exploding gradients issue
Deep Learning Inference Frameworks for ARM CPU
The deep learning community focuses on training networks for a better accuracy on GPU servers. However, bringing this technology to consumer products requires inference adaptation of suchInstruction networks for low-energy, small-memory, and computationally constrained edge devices. ARM CPU is one of the important components of edge devices, but a clear comparison between the existinginference frameworks is missing. We provide minimal preliminaries about ARM CPU architecture and briefly mention the difference between the existing inference frameworks to evaluate them based on performance versus usability trade-offs
Binary Quantizer
One-bit quantization is a general tool to execute a complex model,such as deep neural networks, on a device with limited resources,such as cell phones. Naively compressing weights into one bityields an extensive accuracy loss. One-bit models, therefore, re-quire careful re-training. Here we introduce a class functions de-vised to be used as a regularizer for re-training one-bit models. Us-ing a regularization function, specifically devised for binary quanti-zation, avoids heuristic touch of the optimization scheme and savesconsiderable coding effort
Fast high-dimensional Bayesian classification and clustering
We introduce a fast approach to classification and clustering applicable to high-dimensional continuous data, based on Bayesian mixture models for which explicit computations are available. This permits us to treat classification and clustering in a single framework, and allows calculation of unobserved class probability. The new classifier is robust to adding noise variables as a drawback of the built-in spike-and-slab structure of the proposed Bayesian model. The usefulness of classification using our method is shown on metabololomic example, and on the Iris data with and without noise variables. Agglomerative hierarchical clustering is used to construct a dendrogram based on the posterior probabilities of particular partitions, to provide a dendrogram with a probabilistic interpretation. An extension to variable selection is proposed which summarises the importance of variables for classification or clustering and has probabilistic interpretation. Having a simple model provides estimation of the model parameters using maximum likelihood and therefore yields a fully automatic algorithm. The new clustering method is applied to metabolomic, microarray, and image data and is studied using simulated data motivated by real datasets. The computational difficulties of the new approach are discussed, solutions for algorithm acceleration are proposed, and the written computer code is briefly analysed. Simulations shows that the quality of the estimated model parameters depends on the parametric distribution assumed for effects, but after fixing the model parameters to reasonable values, the distribution of the effects influences clustering very little. Simulations confirms that the clustering algorithm and the proposed variable selection method is reliable when the model assumptions are wrong. The new approach is compared with the popular Bayesian clustering alternative, MCLUST, fitted on the principal components using two loss functions in which our proposed approach is found to be more efficient in almost every situation
Activation Adaptation in Neural Networks
Many neural network architectures rely on the choice of the activation
function for each hidden layer. Given the activation function, the neural
network is trained over the bias and the weight parameters. The bias catches
the center of the activation, and the weights capture the scale. Here we
propose to train the network over a shape parameter as well. This view allows
each neuron to tune its own activation function and adapt the neuron curvature
towards a better prediction. This modification only adds one further equation
to the back-propagation for each neuron. Re-formalizing activation functions as
CDF generalizes the class of activation function extensively. We aimed at
generalizing an extensive class of activation functions to study: i) skewness
and ii) smoothness of activation functions. Here we introduce adaptive Gumbel
activation function as a bridge between Gumbel and sigmoid. A similar approach
is used to invent a smooth version of ReLU. Our comparison with common
activation functions suggests different data representation especially in early
neural network layers. This adaptation also provides prediction improvement
- …