Convolutional neural networks (CNNs) have shown great success in computer
vision, approaching human-level performance when trained for specific tasks via
application-specific loss functions. In this paper, we propose a method for
augmenting and training CNNs so that their learned features are compositional.
It encourages networks to form representations that disentangle objects from
their surroundings and from each other, thereby promoting better
generalization. Our method is agnostic to the specific details of the
underlying CNN to which it is applied and can in principle be used with any
CNN. As we show in our experiments, the learned representations lead to feature
activations that are more localized and improve performance over
non-compositional baselines in object recognition tasks.Comment: Preprint appearing in CVPR 201