The structural re-parameterization (SRP) technique is a novel deep learning
technique that achieves interconversion between different network architectures
through equivalent parameter transformations. This technique enables the
mitigation of the extra costs for performance improvement during training, such
as parameter size and inference time, through these transformations during
inference, and therefore SRP has great potential for industrial and practical
applications. The existing SRP methods have successfully considered many
commonly used architectures, such as normalizations, pooling methods,
multi-branch convolution. However, the widely used self-attention modules
cannot be directly implemented by SRP due to these modules usually act on the
backbone network in a multiplicative manner and the modules' output is
input-dependent during inference, which limits the application scenarios of
SRP. In this paper, we conduct extensive experiments from a statistical
perspective and discover an interesting phenomenon Stripe Observation, which
reveals that channel attention values quickly approach some constant vectors
during training. This observation inspires us to propose a simple-yet-effective
attention-alike structural re-parameterization (ASR) that allows us to achieve
SRP for a given network while enjoying the effectiveness of the self-attention
mechanism. Extensive experiments conducted on several standard benchmarks
demonstrate the effectiveness of ASR in generally improving the performance of
existing backbone networks, self-attention modules, and SRP methods without any
elaborated model crafting. We also analyze the limitations and provide
experimental or theoretical evidence for the strong robustness of the proposed
ASR.Comment: Technical repor