Following the traditional paradigm of convolutional neural networks (CNNs),
modern CNNs manage to keep pace with more recent, for example
transformer-based, models by not only increasing model depth and width but also
the kernel size. This results in large amounts of learnable model parameters
that need to be handled during training. While following the convolutional
paradigm with the according spatial inductive bias, we question the
significance of \emph{learned} convolution filters. In fact, our findings
demonstrate that many contemporary CNN architectures can achieve high test
accuracies without ever updating randomly initialized (spatial) convolution
filters. Instead, simple linear combinations (implemented through efficient
1×1 convolutions) suffice to effectively recombine even random filters
into expressive network operators. Furthermore, these combinations of random
filters can implicitly regularize the resulting operations, mitigating
overfitting and enhancing overall performance and robustness. Conversely,
retaining the ability to learn filter updates can impair network performance.
Lastly, although we only observe relatively small gains from learning 3×3 convolutions, the learning gains increase proportionally with kernel size,
owing to the non-idealities of the independent and identically distributed
(\textit{i.i.d.}) nature of default initialization techniques