Early detection of patients vulnerable to infections acquired in the hospital
environment is a challenge in current health systems given the impact that such
infections have on patient mortality and healthcare costs. This work is focused
on both the identification of risk factors and the prediction of
healthcare-associated infections in intensive-care units by means of
machine-learning methods. The aim is to support decision making addressed at
reducing the incidence rate of infections. In this field, it is necessary to
deal with the problem of building reliable classifiers from imbalanced
datasets. We propose a clustering-based undersampling strategy to be used in
combination with ensemble classifiers. A comparative study with data from 4616
patients was conducted in order to validate our proposal. We applied several
single and ensemble classifiers both to the original dataset and to data
preprocessed by means of different resampling methods. The results were
analyzed by means of classic and recent metrics specifically designed for
imbalanced data classification. They revealed that the proposal is more
efficient in comparison with other approaches