Leveraging the characteristics of convolutional layers, neural networks are
extremely effective for pattern recognition tasks. However in some cases, their
decisions are based on unintended information leading to high performance on
standard benchmarks but also to a lack of generalization to challenging testing
conditions and unintuitive failures. Recent work has termed this "shortcut
learning" and addressed its presence in multiple domains. In text recognition,
we reveal another such shortcut, whereby recognizers overly depend on local
image statistics. Motivated by this, we suggest an approach to regulate the
reliance on local statistics that improves text recognition performance.
Our method, termed TextAdaIN, creates local distortions in the feature map
which prevent the network from overfitting to local statistics. It does so by
viewing each feature map as a sequence of elements and deliberately mismatching
fine-grained feature statistics between elements in a mini-batch. Despite
TextAdaIN's simplicity, extensive experiments show its effectiveness compared
to other, more complicated methods. TextAdaIN achieves state-of-the-art results
on standard handwritten text recognition benchmarks. It generalizes to multiple
architectures and to the domain of scene text recognition. Furthermore, we
demonstrate that integrating TextAdaIN improves robustness towards more
challenging testing conditions. The official Pytorch implementation can be
found at https://github.com/amazon-research/textadain-robust-recognition.Comment: 12 pages, 8 figures, Accepted to ECCV 202