This paper first answers the question "why do the two most powerful
techniques Dropout and Batch Normalization (BN) often lead to a worse
performance when they are combined together?" in both theoretical and
statistical aspects. Theoretically, we find that Dropout would shift the
variance of a specific neural unit when we transfer the state of that network
from train to test. However, BN would maintain its statistical variance, which
is accumulated from the entire learning procedure, in the test phase. The
inconsistency of that variance (we name this scheme as "variance shift") causes
the unstable numerical behavior in inference that leads to more erroneous
predictions finally, when applying Dropout before BN. Thorough experiments on
DenseNet, ResNet, ResNeXt and Wide ResNet confirm our findings. According to
the uncovered mechanism, we next explore several strategies that modifies
Dropout and try to overcome the limitations of their combination by avoiding
the variance shift risks.Comment: 9 pages, 7 figure