Machine Learning (ML) has emerged as a promising approach in healthcare,
outperforming traditional statistical techniques. However, to establish ML as a
reliable tool in clinical practice, adherence to best practices regarding data
handling, experimental design, and model evaluation is crucial. This work
summarizes and strictly observes such practices to ensure reproducible and
reliable ML. Specifically, we focus on Alzheimer's Disease (AD) detection,
which serves as a paradigmatic example of challenging problem in healthcare. We
investigate the impact of different data augmentation techniques and model
complexity on the overall performance. We consider MRI data from ADNI dataset
to address a classification problem employing 3D Convolutional Neural Network
(CNN). The experiments are designed to compensate for data scarcity and initial
random parameters by utilizing cross-validation and multiple training trials.
Within this framework, we train 15 predictive models, considering three
different data augmentation strategies and five distinct 3D CNN architectures,
each varying in the number of convolutional layers. Specifically, the
augmentation strategies are based on affine transformations, such as zoom,
shift, and rotation, applied concurrently or separately. The combined effect of
data augmentation and model complexity leads to a variation in prediction
performance up to 10% of accuracy. When affine transformation are applied
separately, the model is more accurate, independently from the adopted
architecture. For all strategies, the model accuracy followed a concave
behavior at increasing number of convolutional layers, peaking at an
intermediate value of layers. The best model (8 CL, (B)) is the most stable
across cross-validation folds and training trials, reaching excellent
performance both on the testing set and on an external test set