The well-documented presence of texture bias in modern convolutional neural
networks has led to a plethora of algorithms that promote an emphasis on shape
cues, often to support generalization to new domains. Yet, common datasets,
benchmarks and general model selection strategies are missing, and there is no
agreed, rigorous evaluation protocol. In this paper, we investigate
difficulties and limitations when training networks with reduced texture bias.
In particular, we also show that proper evaluation and meaningful comparisons
between methods are not trivial. We introduce BiasBed, a testbed for texture-
and style-biased training, including multiple datasets and a range of existing
algorithms. It comes with an extensive evaluation protocol that includes
rigorous hypothesis testing to gauge the significance of the results, despite
the considerable training instability of some style bias methods. Our extensive
experiments, shed new light on the need for careful, statistically founded
evaluation protocols for style bias (and beyond). E.g., we find that some
algorithms proposed in the literature do not significantly mitigate the impact
of style bias at all. With the release of BiasBed, we hope to foster a common
understanding of consistent and meaningful comparisons, and consequently faster
progress towards learning methods free of texture bias. Code is available at
https://github.com/D1noFuzi/BiasBe