Detecting small scene text instances in the wild is particularly challenging,
where the influence of irregular positions and nonideal lighting often leads to
detection errors. We present MixNet, a hybrid architecture that combines the
strengths of CNNs and Transformers, capable of accurately detecting small text
from challenging natural scenes, regardless of the orientations, styles, and
lighting conditions. MixNet incorporates two key modules: (1) the Feature
Shuffle Network (FSNet) to serve as the backbone and (2) the Central
Transformer Block (CTBlock) to exploit the 1D manifold constraint of the scene
text. We first introduce a novel feature shuffling strategy in FSNet to
facilitate the exchange of features across multiple scales, generating
high-resolution features superior to popular ResNet and HRNet. The FSNet
backbone has achieved significant improvements over many existing text
detection methods, including PAN, DB, and FAST. Then we design a complementary
CTBlock to leverage center line based features similar to the medial axis of
text regions and show that it can outperform contour-based approaches in
challenging cases when small scene texts appear closely. Extensive experimental
results show that MixNet, which mixes FSNet with CTBlock, achieves
state-of-the-art results on multiple scene text detection datasets