The use of Convolutional Neural Networks (CNN) in medical imaging has often outperformed previous solutions and even specialists, becoming a promising technology for Computer-aided-Diagnosis (CAD) systems. However, recent works suggested that CNN may have poor generalisation on new data, for instance, generated in different hospitals. Uncontrolled confounders have been proposed as a common reason. In this paper, we experimentally demonstrate the impact of confounding data in unknown scenarios. We assessed the effect of four confounding configurations: total, strong, light and balanced. We found the confounding effect is especially prominent in total confounder scenarios, while the effect on light and strong confounding scenarios may depend on the dataset robustness. Our findings indicate that the confounding effect is independent of the architecture employed.
These findings might explain why models can report good metrics during the development stage but fail to translate to real-world settings. We highlight the need for thorough consideration of these commonly unattended aspects, to develop safer CNN-based CAD systems