3 research outputs found

    Relative performance of machine learning and linear regression in predicting quality of life and academic performance of school children in Norway : data analysis of a quasi-experimental study

    Get PDF
    Background: Machine learning (ML) approaches are increasingly being used in health research. It is not clear how useful these approaches are for modelling continuous health outcomes. Child quality of life (QoL) is associated with parental socioeconomic status and child activity levels, and may be associated with aerobic fitness and strength. It is not clear whether diet, or academic performance (AP) is associated with QoL. Objective: To compare predictive performances of ML approaches with linear regression for modelling QoL and AP using parental education and lifestyle data. Methods: We modelled data from children attending nine schools in a quasi-experimental study (NCT02495714). We split data randomly into training and validation sets, and simulated curvilinear, non-linear, and heteroscedastic variables. We examined relative performance of ML approaches using R2, making comparisons to mixed and fixed models, and regression with splines, with and without imputation. We also examined the effect of training set size on overfitting. Results: We had 1,711 cases. Using real data, our regression models explained 24% of AP variance in the complete-case validation set, and up to 15% of QoL variance. While ML models explained high proportions of variance in training sets, in validation sets these explained ~0% of AP and between 3% and 8% of QoL. Following imputation, ML models improved up to 15% for AP. ML models outperformed regression for modelling simulated non-linear and heteroscedastic variables only. A smaller training set did not lead to increased overfitting. The best predictors of QoL were 7-point self-reported activity (P<.001; ß=1.09 (95% CI 0.53 to 1.66)) and TV/computer use (P=.002; ß=-0.95 (-1.55 to -0.36)). For AP, these were mother having master’s-level education (P<.001; ß=1.98 (0.25 to 3.71)) and dichotomised self-reported activity (P=.001; ß=2.47 (1.08 to 3.87)). Adjusted academic performance was associated with QoL (P=.02; ß=0.12 (0.02 to 0.22)). Conclusions: Exercising to cause sweat once per week and 2 hours per day of TV or computer use are associated with small-to-medium increases and decreases in child QoL, respectively. An increase in AP of 20 units is associated with a small increase in QoL. A mother having higher and master’s-level education, 2 hours per day of TV or computer use, and taking at least 2 hours of exercise, are each associated with small-to-medium increases in AP. Differences between effects of computer/TV use for work/leisure needs further investigation. Linear regression is less prone to overfitting and performs better than ML in predicting continuous health outcomes in a dataset containing missing data. Imputation improves ML performance but not enough to outperform regression. ML outperformed regression with non-linear and heteroscedastic data and may be of use when such relationships exist, and where imputation is sensible or there are no missing data. Clinical Trial: The data are from a quasi-experimental design and not an RCT but nevertheless the study from which the data are from does have a registration: NCT0249571
    corecore