49,590 research outputs found
Batch Normalization Preconditioning for Neural Network Training
Batch normalization (BN) is a popular and ubiquitous method in deep learning that has been shown to decrease training time and improve generalization performance of neural networks. Despite its success, BN is not theoretically well understood. It is not suitable for use with very small mini-batch sizes or online learning. In this work, we propose a new method called Batch Normalization Preconditioning (BNP). Instead of applying normalization explicitly through a batch normalization layer as is done in BN, BNP applies normalization by conditioning the parameter gradients directly during training. This is designed to improve the Hessian matrix of the loss function and hence convergence during training. One benefit is that BNP is not constrained on the mini-batch size and works in the online learning setting. We also extend this technique to Bayesian neural networks which are networks that have probability distributions corresponding to the weights and biases instead of single fixed values. In particular, we apply BNP to stochastic gradient Langevin dynamics (SGLD), which is a standard sampling technique for uncertainty estimation in Bayesian neural networks
Inherent Weight Normalization in Stochastic Neural Networks
Multiplicative stochasticity such as Dropout improves the robustness and
generalizability of deep neural networks. Here, we further demonstrate that
always-on multiplicative stochasticity combined with simple threshold neurons
are sufficient operations for deep neural networks. We call such models Neural
Sampling Machines (NSM). We find that the probability of activation of the NSM
exhibits a self-normalizing property that mirrors Weight Normalization, a
previously studied mechanism that fulfills many of the features of Batch
Normalization in an online fashion. The normalization of activities during
training speeds up convergence by preventing internal covariate shift caused by
changes in the input distribution. The always-on stochasticity of the NSM
confers the following advantages: the network is identical in the inference and
learning phases, making the NSM suitable for online learning, it can exploit
stochasticity inherent to a physical substrate such as analog non-volatile
memories for in-memory computing, and it is suitable for Monte Carlo sampling,
while requiring almost exclusively addition and comparison operations. We
demonstrate NSMs on standard classification benchmarks (MNIST and CIFAR) and
event-based classification benchmarks (N-MNIST and DVS Gestures). Our results
show that NSMs perform comparably or better than conventional artificial neural
networks with the same architecture
SPOC learner's final grade prediction based on a novel sampling batch normalization embedded neural network method
Recent years have witnessed the rapid growth of Small Private Online Courses
(SPOC) which is able to highly customized and personalized to adapt variable
educational requests, in which machine learning techniques are explored to
summarize and predict the learner's performance, mostly focus on the final
grade. However, the problem is that the final grade of learners on SPOC is
generally seriously imbalance which handicaps the training of prediction model.
To solve this problem, a sampling batch normalization embedded deep neural
network (SBNEDNN) method is developed in this paper. First, a combined
indicator is defined to measure the distribution of the data, then a rule is
established to guide the sampling process. Second, the batch normalization (BN)
modified layers are embedded into full connected neural network to solve the
data imbalanced problem. Experimental results with other three deep learning
methods demonstrates the superiority of the proposed method.Comment: 11 pages, 5 figures, ICAIS 202
Improved Dropout for Shallow and Deep Learning
Dropout has been witnessed with great success in training deep neural
networks by independently zeroing out the outputs of neurons at random. It has
also received a surge of interest for shallow learning, e.g., logistic
regression. However, the independent sampling for dropout could be suboptimal
for the sake of convergence. In this paper, we propose to use multinomial
sampling for dropout, i.e., sampling features or neurons according to a
multinomial distribution with different probabilities for different
features/neurons. To exhibit the optimal dropout probabilities, we analyze the
shallow learning with multinomial dropout and establish the risk bound for
stochastic optimization. By minimizing a sampling dependent factor in the risk
bound, we obtain a distribution-dependent dropout with sampling probabilities
dependent on the second order statistics of the data distribution. To tackle
the issue of evolving distribution of neurons in deep learning, we propose an
efficient adaptive dropout (named \textbf{evolutional dropout}) that computes
the sampling probabilities on-the-fly from a mini-batch of examples. Empirical
studies on several benchmark datasets demonstrate that the proposed dropouts
achieve not only much faster convergence and but also a smaller testing error
than the standard dropout. For example, on the CIFAR-100 data, the evolutional
dropout achieves relative improvements over 10\% on the prediction performance
and over 50\% on the convergence speed compared to the standard dropout.Comment: In NIPS 201
A Batch Noise Contrastive Estimation Approach for Training Large Vocabulary Language Models
Training large vocabulary Neural Network Language Models (NNLMs) is a
difficult task due to the explicit requirement of the output layer
normalization, which typically involves the evaluation of the full softmax
function over the complete vocabulary. This paper proposes a Batch Noise
Contrastive Estimation (B-NCE) approach to alleviate this problem. This is
achieved by reducing the vocabulary, at each time step, to the target words in
the batch and then replacing the softmax by the noise contrastive estimation
approach, where these words play the role of targets and noise samples at the
same time. In doing so, the proposed approach can be fully formulated and
implemented using optimal dense matrix operations. Applying B-NCE to train
different NNLMs on the Large Text Compression Benchmark (LTCB) and the One
Billion Word Benchmark (OBWB) shows a significant reduction of the training
time with no noticeable degradation of the models performance. This paper also
presents a new baseline comparative study of different standard NNLMs on the
large OBWB on a single Titan-X GPU.Comment: Accepted for publication at INTERSPEECH'1
- …