Neural networks are comprised of two components, weights andactivation function. Ternary weight neural networks (TNNs) achievea good performance and offer up to 16x compression ratio. TNNsare difficult to train without BatchNorm and there has been no studyto clarify the role of BatchNorm in a ternary network. Benefitingfrom a study in binary networks, we show how BatchNorm helps inresolving the exploding gradients issue

Nia, Vahid Partovi

Sari, Eyyüb

English

Waterloo Library Journal Publishing Service (University of Waterloo, Canada)

Understanding BatchNorm in Ternary TrainingEyyüb Sari Huawei Noah’s Ark Lab, QC, CanadaVahid Partovi Nia Huawei Noah’s Ark Lab, QC, CanadaAbstractNeural networks are comprised of two components, weights andactivation function. Ternary weight neural networks (TNNs) achievea good performance and offer up to 16x compression ratio. TNNsare difficult to train without BatchNorm and there has been no studyto clarify the role of BatchNorm in a ternary network. Benefitingfrom a study in binary networks, we show how BatchNorm helps inresolving the exploding gradients issue.1 IntroductionCompression of Deep Neural Networks (DNNs) is crucial for de-ploying these huge and energy-hungry model on edge devices.Quantization methods are a set of techniques targeting reducedbit-precision representation. Two well-known extreme cases arebinary and and ternary networks, that allow up to 32x and 16xcompression rate, respectively. Contrary to binary weights −1,+1,ternary weights −1,0,+1 allow for representing 0. Greater flexibilityis provided by this scheme, because it offers discarding a value asa builtin operation, which is specially helpful in keeping accuracyin the presence of point-wise, depth-wise convolution. The role ofBatchNorm in binary networks with binary activations is alreadystudied in [1]. [2] reports BatchNorm helps training binary andternary networks. As ternary networks with full-precision activationsare very different models, we are wondering if BatchNorm plays asimilar role in ternary networks.2 NotationTernary neural networks (TNNs) use full-precision weights duringtraining which are ternarized during forward propagation. The full-precision weights act as latent parameters and allow for incrementalupdates. Let x ∈ IR, given a threshold ∆ we define the ternaryfunction as,tern(x) =−1 if x<−∆+1 if x> ∆0 if −∆≤ x≤ ∆(1)For a weight w for which we apply tern(w) during forward propaga-tion, we define its gradient with respect to a loss function L(.) as∂L∂w∣∣∣w=tern(w). The gradient is evaluated for the ternarized weightbut accumulated in full-precision. However, at the initialization step,weights are drawn from a given random number generator. Thus,applying ternary function on them can be seen as applying a trans-formation on a random variable, w˜t .tern(w) = w˜t =−1 with p1 = P(w<−∆)+1 with p2 = P(w> ∆)0 with p3 = 1− p1− p2(2)This setting is very similar to the binary setting of [1], for ∆= 0, orequivalently p3 = 0. This property helps to use the result of [1] andgeneralize it towards the ternary case.Following the initialization schemes such as [3] or [4], weightsare drawn from symmetric distributions about zero (eg. uniform).Therefore p1 = p2 and we obtain the following identities for theexpectation and variance of the transformed random variableE[w˜t ] = 0 V(w˜t) = 2p1. (3)The variance of the initial full-precision random variable plays acrucial role in the variance of the ternary random variable, and thisis where this study differs from [1]. We follow the notation of [1] tosave space. Denote the dot product slb ∈ IRKl of the batch sample bin the neural network layer l, that has Kl number of neurons. Definef to be the element-wise activation function, and xb to be the input,Wl ∈ IRKl−1×Kl with elements Wl = [wlkk′ ] to be the weight matrix;.Here we use wl [i, j] to refer to a single element of Wl , in which all ofthem are i.i.d. so we drop the index [i, j] and simply denote it by wl .∂L∂ slbk= f ′(slbk)Kl+1∑k′=1wl+1kk′∂L∂ sl+1bk′, (4)∂L∂wlk′k=B∑b=1sl−1bk′∂L∂ slbk. (5)for the details of development see [1].Assume that the feature element x and the weight element ware centred and i.i.d. Let k denote the current neuron and k′ denotethe previous or the next layer neuron. For the ReLU activationfunction, one can show V(slbk) =V(x)∏l−1l′=112Kl′ V(wl′), where V(wl′)is the variance of the weight in layer l′ if w is drawn from a uniformdistribution symmetric about zero.By applying similar mathematical mechanics of [1], the varianceof the gradient for a neuron isV(∂L∂ slbk) = V(∂L∂ sL)L∏l′=l+112Kl′ V(wl′), (6)which explodes or vanishes depending on V(wl′). This is the mainreason common full-precision initialization methods suggest V(wl)=2Kl .3 BatchNorm RoleTo control the variance throughout layers during backpropagationwe know V(wl) = 2Kl is needed. In the ternary weight case, we needV(w˜lt) = 2p1 = 2Kl . It is easy to see p1 =1Kl is required. Commoninitialization [4] draws w˜l ∼ U(−√6Kl ,√6Kl )P(wl <−∆) = p1 = 12 −∆2√6Kl(7)To satisfy (7), the threshold has to be properly set∆= 2√6Kl(12− 1Kl). (8)In real world settings, e.g. for a convolutional layer with 3×3 kerneland 128 filters, p1 ≈ 8× 10−4. Therefore, the threshold would beset so that more than 99% of the weights are zero to control thevariance. As a big downfall, learning is made impossible in this caseas most of the weights are set to zero. Contrary to our conclusion,let’s suppose the threshold is given so that the learning is feasible,for instance ∆ is given so that < 50% of ternary weights are set tozeroV(w˜lt) = 2p1 = 1−∆√6Kl, (9)for any given ∆. In the literature [5] suggests to set ∆l = 0.7E[|wl |].Following common initialization schemes∆l =0.72√6Kl(10)and (9) reduces to V(w˜lt) = 1− 0.72 = 0.65. In this setting, variance isbigger than 2Kl which produces exploding gradients. The situation issimilar to the binary case reported in [1], giving us a reason to takea closer look to BatchNorm in ternary setting.Suppose a mini batch of size B for a given neuron k. Let µˆk, σˆkbe the mean and the standard deviation of the dot product slbk,b=1, . . .B. For a given layer l, BatchNorm is defined as BN(sbk)≡ zbk =γk sˆbk+βk, where sˆbk =sbk−µˆkσˆk is the standardized dot product andthe pair (γk, βk) is trainable, and often initialized to (1,0). Following[1], it is easy to showV( ∂L∂ slbk)=( γ lkBσˆ lk)2{B2+2B−1+V(sˆl2bk)}12Kl+1V(w˜l+1t )V( ∂L∂ sl+1). (11)Following common full precision initialization [4] assumptions, i.e.weights and activation are i.i.d. and weights are centred about zero,for a layer l,σˆ2k = Kl−112V(sˆl−1b )V(w˜lt) = Kl−112V(w˜lt). (12)Therefore (11) reduces to,V( ∂L∂ slbk)={B2+2B−1+V(sˆl2bk)}B2Kl+1Kl−1V( ∂L∂ sl+1)(13)={1+o(1B1−ε)}Kl+1Kl−1V( ∂L∂ sl+1). (14)The equation (13) gives confirms a similar conclusion as inbinary case, i.e. BatchNorm indeed prevents exploding gradients.4 Numerical ExperimentWe evaluate four different scenarios on the CIFAR-10 dataset [6].It contains 50,000 training images and 10,000 test images. Eachimage is 32×32 pixels with RGB channels. While training, data aug-mentation is applied. We pad the images with 4 zeroes on each side.After this step, a random crop of 32×32 is taken out the 36×36pxpadded images. Finally, images are uniformly randomly flippedhorizontally. During training and test time, the images are normal-ized with µ = (0.4914,0.4822,0.4465), σ = (0.247,0.243,0.261). Weuse the VGG-7 architecture defined in [5] with BatchNorm, ReLUactivation function and ternary weights. Experiments on ResNet-56[7] are also performed. The shortcut connection can alleviate theexploding gradient issues to some extent. Lightweight model arerather harder to train and are much sensitive to instabilities, there-fore we also include a study on MobileNet-v1 that clarifies the effectof exploding gradient. Each model is trained for 150 epochs usingSGD optimizer with momentum set to 0.9 and a starting learningrate set to 0.1. The learning rate is decayed by 10 at epochs 80 and120. L2 regularization is applied with λ = 10−4. The mini-batch sizeis 100.We experiments four setting for each architectures. i) Batch-Norm and TWN threshold [5] (BN), ii) removing BatchNorm butkeeping TWN threshold (No BN), iii) using BatchNorm but settingthreshold as defined in (10) (Sparse BN), iv) and no BatchNormwith threshold from (10) (Sparse No BN). In fact, i) can be regardedas the baseline, not to be confused the fully full-precision model. ii)provides an interesting observation, relatively shallow model suchas VGG-7 still achieve decent accuracy even if BatchNorm is notpresent, the model is too shallow to be show the exploding gradienteffect. On the other side, ResNet-56 which is a deeper model andsupposed to suffer from accuracy loss, but recovers because ofthe short-cut connection. MobileNet diverges without BatchNormbecause the model being deeper than VGG-7 and includes noshortcut connection to compensate for the exploding gradient effect.Items iii) and iv) confirms if the thresholds ∆l are selected to ensurevariance control (10), most of the weights are ternarized to zeroand the models do not converge due to bad initialization. Even inthis setting, ResNet-56 is still able to produce better outputs than arandom predictor because of the information being carried on viathe shortcut connections.5 ConclusionWe find that theoretically, gradient explosion could be preventedwithout the use of BatchNorm by setting a proper threshold formapping to zero and these results are backed up with numericalexperiments. In our theoretical finding choosing an appropriatethreshold ∆ sets most of the weights to zero, which in practice donot allow TNNs to converge. Also, BatchNorm indeed also preventsgradient explosion independent of the chosen ∆.BN No BN Sparse BN Sparse No BNVGG-7 93.5 78.1 - -ResNet-56 92.7 85.7 38.9 39.3MobileNet 88.3 - - -Table 1: Ablation of BatchNorm and threshold on VGG-7, ResNet-56and MobileNet, maximum accuracy achieved after training. Batch-Norm and TWN threshold [5] (BN), removing BatchNorm but keep-ing TWN threshold (No BN), using BatchNorm but setting thresholdas defined in (10) (Sparse BN), and no BatchNorm with thresholdfrom (10) (Sparse No BN). Results are not reported if the networkdid not converge (i.e. not better than random)6 AcknowledgementWe would like to thank Huawei CBG Software Shanghai colleaguesMohan Liu and Li Zhou for their fruitful technical discussions. Wealso thank Yanhui Geng and Jin Tang for their support throughoutthe project.References[1] Eyyüb Sari, Mouloud Belbahri, and Vahid Partovi Nia. Howdoes batch normalization help binary training? arXiv,abs/1909.09139, 2019.[2] Arash Ardakani, Zhengyun Ji, Sean C. Smithson, Brett H. Meyer,and Warren J. Gross. Learning recurrent binary/ternary weights.In International Conference on Learning Representations, 2019.[3] Xavier Glorot and Yoshua Bengio. Understanding the difficultyof training deep feedforward neural networks. In Proceedings ofthe thirteenth international conference on artificial intelligenceand statistics, pages 249–256, 2010.[4] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Delv-ing deep into rectifiers: Surpassing human-level performanceon imagenet classification. CoRR, abs/1502.01852, 2015.[5] Fengfu Li and Bin Liu. Ternary weight networks. arXiv,abs/1605.04711, 2016.[6] Alex Krizhevsky. Learning multiple layers of features from tinyimages. 2009.[7] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and JianSun. Deep residual learning for image recognition. CoRR,abs/1512.03385, 2015.

Understanding BatchNorm in Ternary Training

https://openjournals.uwaterloo.ca/index.php/vsl/article/download/1646/2015

Understanding BatchNorm in Ternary Training

Abstract

Similar works

Full text

Available Versions

Waterloo Library Journal Publishing Service (University of Waterloo, Canada)