While the use of bottom-up local operators in convolutional neural networks
(CNNs) matches well some of the statistics of natural images, it may also
prevent such models from capturing contextual long-range feature interactions.
In this work, we propose a simple, lightweight approach for better context
exploitation in CNNs. We do so by introducing a pair of operators: gather,
which efficiently aggregates feature responses from a large spatial extent, and
excite, which redistributes the pooled information to local features. The
operators are cheap, both in terms of number of added parameters and
computational complexity, and can be integrated directly in existing
architectures to improve their performance. Experiments on several datasets
show that gather-excite can bring benefits comparable to increasing the depth
of a CNN at a fraction of the cost. For example, we find ResNet-50 with
gather-excite operators is able to outperform its 101-layer counterpart on
ImageNet with no additional learnable parameters. We also propose a parametric
gather-excite operator pair which yields further performance gains, relate it
to the recently-introduced Squeeze-and-Excitation Networks, and analyse the
effects of these changes to the CNN feature activation statistics.Comment: NeurIPS 201