Since the recent success of Vision Transformers (ViTs), explorations toward
transformer-style architectures have triggered the resurgence of modern
ConvNets. In this work, we explore the representation ability of DNNs through
the lens of interaction complexities. We empirically show that interaction
complexity is an overlooked but essential indicator for visual recognition.
Accordingly, a new family of efficient ConvNets, named MogaNet, is presented to
pursue informative context mining in pure ConvNet-based models, with preferable
complexity-performance trade-offs. In MogaNet, interactions across multiple
complexities are facilitated and contextualized by leveraging two specially
designed aggregation blocks in both spatial and channel interaction spaces.
Extensive studies are conducted on ImageNet classification, COCO object
detection, and ADE20K semantic segmentation tasks. The results demonstrate that
our MogaNet establishes new state-of-the-art over other popular methods in
mainstream scenarios and all model scales. Typically, the lightweight MogaNet-T
achieves 80.0\% top-1 accuracy with only 1.44G FLOPs using a refined training
setup on ImageNet-1K, surpassing ParC-Net-S by 1.4\% accuracy but saving 59\%
(2.04G) FLOPs.Comment: Preprint with 14 pages of main body and 5 pages of appendi