A critical look at studies applying over-sampling on the TPEHGDB dataset

A García-Blanco; A Smrdel; AJ Hussain; AL Goldberger; DA Silva De; G Fele-Žorž; H Watson; J Ryu; K Subramaniam; L Liu; LJ Meertens; M Shahrdad; MU Ahmed; N Sadi-Ahmed; NV Chawla; P Fergus; P Fergus; P Fergus; P Ren; S Sim; SM Naeem; UR Acharya

research

A critical look at studies applying over-sampling on the TPEHGDB dataset

Authors: A García-Blanco
A Smrdel
AJ Hussain
AL Goldberger
DA Silva De
G Fele-Žorž
H Watson
J Ryu
K Subramaniam
L Liu
LJ Meertens
M Shahrdad
MU Ahmed
N Sadi-Ahmed
NV Chawla
P Fergus
P Fergus
P Fergus
P Ren
S Sim
SM Naeem
UR Acharya
Publication date: 1 January 2019
Publisher: 'Springer Science and Business Media LLC'
Doi

Abstract

Preterm birth is the leading cause of death among young children and has a large prevalence globally. Machine learning models, based on features extracted from clinical sources such as electronic patient files, yield promising results. In this study, we review similar studies that constructed predictive models based on a publicly available dataset, called the Term-Preterm EHG Database (TPEHGDB), which contains electrohysterogram signals on top of clinical data. These studies often report near-perfect prediction results, by applying over-sampling as a means of data augmentation. We reconstruct these results to show that they can only be achieved when data augmentation is applied on the entire dataset prior to partitioning into training and testing set. This results in (i) samples that are highly correlated to data points from the test set are introduced and added to the training set, and (ii) artificial samples that are highly correlated to points from the training set being added to the test set. Many previously reported results therefore carry little meaning in terms of the actual effectiveness of the model in making predictions on unseen data in a real-world setting. After focusing on the danger of applying over-sampling strategies before data partitioning, we present a realistic baseline for the TPEHGDB dataset and show how the predictive performance and clinical use can be improved by incorporating features from electrohysterogram sensors and by applying over-sampling on the training set

Similar works

Full text

Open in the Core reader

Download PDF

Available Versions

Crossref

Last time updated on 10/08/2021

Ghent University Academic Bibliography

oai:archive.ugent.be:8628812

Last time updated on 04/11/2019