The rapid advances in Vision Transformer (ViT) refresh the state-of-the-art
performances in various vision tasks, overshadowing the conventional CNN-based
models. This ignites a few recent striking-back research in the CNN world
showing that pure CNN models can achieve as good performance as ViT models when
carefully tuned. While encouraging, designing such high-performance CNN models
is challenging, requiring non-trivial prior knowledge of network design. To
this end, a novel framework termed Mathematical Architecture Design for Deep
CNN (DeepMAD) is proposed to design high-performance CNN models in a principled
way. In DeepMAD, a CNN network is modeled as an information processing system
whose expressiveness and effectiveness can be analytically formulated by their
structural parameters. Then a constrained mathematical programming (MP) problem
is proposed to optimize these structural parameters. The MP problem can be
easily solved by off-the-shelf MP solvers on CPUs with a small memory
footprint. In addition, DeepMAD is a pure mathematical framework: no GPU or
training data is required during network design. The superiority of DeepMAD is
validated on multiple large-scale computer vision benchmark datasets. Notably
on ImageNet-1k, only using conventional convolutional layers, DeepMAD achieves
0.7% and 1.5% higher top-1 accuracy than ConvNeXt and Swin on Tiny level, and
0.8% and 0.9% higher on Small level.Comment: Accepted by CVPR 202