19 research outputs found
Handling Overlapping Asymmetric Data Sets—A Twice Penalized P-Spline Approach
Aims: Overlapping asymmetric data sets are where a large cohort of observations have a small amount of information recorded, and within this group there exists a smaller cohort which have extensive further information available. Missing imputation is unwise if cohort size differs substantially; therefore, we aim to develop a way of modelling the smaller cohort whilst considering the larger. Methods: Through considering traditionally once penalized P-Spline approximations, we create a second penalty term through observing discrepancies in the marginal value of covariates that exist in both cohorts. Our now twice penalized P-Spline is designed to firstly prevent over/under-fitting of the smaller cohort and secondly to consider the larger cohort. Results: Through a series of data simulations, penalty parameter tunings, and model adaptations, our twice penalized model offers up to a 58% and 46% improvement in model fit upon a continuous and binary response, respectively, against existing B-Spline and once penalized P-Spline methods. Applying our model to an individual’s risk of developing steatohepatitis, we report an over 65% improvement over existing methods. Conclusions: We propose a twice penalized P-Spline method which can vastly improve the model fit of overlapping asymmetric data sets upon a common predictive endpoint, without the need for missing data imputation
Handling Overlapping Asymmetric Datasets -- A Twice Penalized P-Spline Approach
Overlapping asymmetric datasets are common in data science and pose questions
of how they can be incorporated together into a predictive analysis. In
healthcare datasets there is often a small amount of information that is
available for a larger number of patients such as an electronic health record,
however a small number of patients may have had extensive further testing.
Common solutions such as missing imputation can often be unwise if the smaller
cohort is significantly different in scale to the larger sample, therefore the
aim of this research is to develop a new method which can model the smaller
cohort against a particular response, whilst considering the larger cohort
also. Motivated by non-parametric models, and specifically flexible smoothing
techniques via generalized additive models, we model a twice penalized P-Spline
approximation method to firstly prevent over/under-fitting of the smaller
cohort and secondly to consider the larger cohort. This second penalty is
created through discrepancies in the marginal value of covariates that exist in
both the smaller and larger cohorts. Through data simulations, parameter
tunings and model adaptations to consider a continuous and binary response, we
find our twice penalized approach offers an enhanced fit over a linear B-Spline
and once penalized P-Spline approximation. Applying to a real-life dataset
relating to a person's risk of developing Non-Alcoholic Steatohepatitis, we see
an improved model fit performance of over 65%. Areas for future work within
this space include adapting our method to not require dimensionality reduction
and also consider parametric modelling methods. However, to our knowledge this
is the first work to propose additional marginal penalties in a flexible
regression of which we can report a vastly improved model fit that is able to
consider asymmetric datasets, without the need for missing data imputation.Comment: 52 pages, 17 figures, 8 tables, 34 reference
Machine learning approaches to enhance diagnosis and staging of patients with MASLD using routinely available clinical information
Aims: Metabolic dysfunction Associated Steatotic Liver Disease (MASLD) outcomes such as MASH (metabolic dysfunction associated steatohepatitis), fibrosis and cirrhosis are ordinarily determined by resource-intensive and invasive biopsies. We aim to show that routine clinical tests offer sufficient information to predict these endpoints. Methods: Using the LITMUS Metacohort derived from the European NAFLD Registry, the largest MASLD dataset in Europe, we create three combinations of features which vary in degree of procurement including a 19-variable feature set that are attained through a routine clinical appointment or blood test. This data was used to train predictive models using supervised machine learning (ML) algorithm XGBoost, alongside missing imputation technique MICE and class balancing algorithm SMOTE. Shapley Additive exPlanations (SHAP) were added to determine relative importance for each clinical variable. Results: Analysing nine biopsy-derived MASLD outcomes of cohort size ranging between 5385 and 6673 subjects, we were able to predict individuals at training set AUCs ranging from 0.719-0.994, including classifying individuals who are At-Risk MASH at an AUC = 0.899. Using two further feature combinations of 26-variables and 35-variables, which included composite scores known to be good indicators for MASLD endpoints and advanced specialist tests, we found predictive performance did not sufficiently improve. We are also able to present local and global explanations for each ML model, offering clinicians interpretability without the expense of worsening predictive performance. Conclusions: This study developed a series of ML models of accuracy ranging from 71.9—99.4% using only easily extractable and readily available information in predicting MASLD outcomes which are usually determined through highly invasive means
Performance of non-invasive tests and histology for the prediction of clinical outcomes in patients with non-alcoholic fatty liver disease: an individual participant data meta-analysis
BackgroundHistologically assessed liver fibrosis stage has prognostic significance in patients with non-alcoholic fatty liver disease (NAFLD) and is accepted as a surrogate endpoint in clinical trials for non-cirrhotic NAFLD. Our aim was to compare the prognostic performance of non-invasive tests with liver histology in patients with NAFLD.MethodsThis was an individual participant data meta-analysis of the prognostic performance of histologically assessed fibrosis stage (F0–4), liver stiffness measured by vibration-controlled transient elastography (LSM-VCTE), fibrosis-4 index (FIB-4), and NAFLD fibrosis score (NFS) in patients with NAFLD. The literature was searched for a previously published systematic review on the diagnostic accuracy of imaging and simple non-invasive tests and updated to Jan 12, 2022 for this study. Studies were identified through PubMed/MEDLINE, EMBASE, and CENTRAL, and authors were contacted for individual participant data, including outcome data, with a minimum of 12 months of follow-up. The primary outcome was a composite endpoint of all-cause mortality, hepatocellular carcinoma, liver transplantation, or cirrhosis complications (ie, ascites, variceal bleeding, hepatic encephalopathy, or progression to a MELD score ≥15). We calculated aggregated survival curves for trichotomised groups and compared them using stratified log-rank tests (histology: F0–2 vs F3 vs F4; LSM: 2·67; NFS: 0·676), calculated areas under the time-dependent receiver operating characteristic curves (tAUC), and performed Cox proportional-hazards regression to adjust for confounding. This study was registered with PROSPERO, CRD42022312226.FindingsOf 65 eligible studies, we included data on 2518 patients with biopsy-proven NAFLD from 25 studies (1126 [44·7%] were female, median age was 54 years [IQR 44–63), and 1161 [46·1%] had type 2 diabetes). After a median follow-up of 57 months [IQR 33–91], the composite endpoint was observed in 145 (5·8%) patients. Stratified log-rank tests showed significant differences between the trichotomised patient groups (p<0·0001 for all comparisons). The tAUC at 5 years were 0·72 (95% CI 0·62–0·81) for histology, 0·76 (0·70–0·83) for LSM-VCTE, 0·74 (0·64–0·82) for FIB-4, and 0·70 (0·63–0·80) for NFS. All index tests were significant predictors of the primary outcome after adjustment for confounders in the Cox regression.InterpretationSimple non-invasive tests performed as well as histologically assessed fibrosis in predicting clinical outcomes in patients with NAFLD and could be considered as alternatives to liver biopsy in some cases
Handling Overlapping Asymmetric Data Sets—A Twice Penalized P-Spline Approach
Aims: Overlapping asymmetric data sets are where a large cohort of observations have a small amount of information recorded, and within this group there exists a smaller cohort which have extensive further information available. Missing imputation is unwise if cohort size differs substantially; therefore, we aim to develop a way of modelling the smaller cohort whilst considering the larger. Methods: Through considering traditionally once penalized P-Spline approximations, we create a second penalty term through observing discrepancies in the marginal value of covariates that exist in both cohorts. Our now twice penalized P-Spline is designed to firstly prevent over/under-fitting of the smaller cohort and secondly to consider the larger cohort. Results: Through a series of data simulations, penalty parameter tunings, and model adaptations, our twice penalized model offers up to a 58% and 46% improvement in model fit upon a continuous and binary response, respectively, against existing B-Spline and once penalized P-Spline methods. Applying our model to an individual’s risk of developing steatohepatitis, we report an over 65% improvement over existing methods. Conclusions: We propose a twice penalized P-Spline method which can vastly improve the model fit of overlapping asymmetric data sets upon a common predictive endpoint, without the need for missing data imputation
Machine learning approaches to enhance diagnosis and staging of patients with MASLD using routinely available clinical information
Abstract: Aims Metabolic dysfunction Associated Steatotic Liver Disease (MASLD) outcomes such as MASH (metabolic dysfunction associated steatohepatitis), fibrosis and cirrhosis are ordinarily determined by resource-intensive and invasive biopsies. We aim to show that routine clinical tests offer sufficient information to predict these endpoints.Methods Using the LITMUS Metacohort derived from the European NAFLD Registry, the largest MASLD dataset in Europe, we create three combinations of features which vary in degree of procurement including a 19-variable feature set that are attained through a routine clinical appointment or blood test. This data was used to train predictive models using supervised machine learning (ML) algorithm XGBoost, alongside missing imputation technique MICE and class balancing algorithm SMOTE. Shapley Additive exPlanations (SHAP) were added to determine relative importance for each clinical variable.Results Analysing nine biopsy-derived MASLD outcomes of cohort size ranging between 5385 and 6673 subjects, we were able to predict individuals at training set AUCs ranging from 0.719-0.994, including classifying individuals who are At-Risk MASH at an AUC = 0.899. Using two further feature combinations of 26-variables and 35-variables, which included composite scores known to be good indicators for MASLD endpoints and advanced specialist tests, we found predictive performance did not sufficiently improve. We are also able to present local and global explanations for each ML model, offering clinicians interpretability without the expense of worsening predictive performance.Conclusions This study developed a series of ML models of accuracy ranging from 71.9-99.4% using only easily extractable and readily available information in predicting MASLD outcomes which are usually determined through highly invasive means
Recommended from our members
Machine learning approaches to enhance diagnosis and staging of patients with MASLD using routinely available clinical information.
AIMS: Metabolic dysfunction Associated Steatotic Liver Disease (MASLD) outcomes such as MASH (metabolic dysfunction associated steatohepatitis), fibrosis and cirrhosis are ordinarily determined by resource-intensive and invasive biopsies. We aim to show that routine clinical tests offer sufficient information to predict these endpoints. METHODS: Using the LITMUS Metacohort derived from the European NAFLD Registry, the largest MASLD dataset in Europe, we create three combinations of features which vary in degree of procurement including a 19-variable feature set that are attained through a routine clinical appointment or blood test. This data was used to train predictive models using supervised machine learning (ML) algorithm XGBoost, alongside missing imputation technique MICE and class balancing algorithm SMOTE. Shapley Additive exPlanations (SHAP) were added to determine relative importance for each clinical variable. RESULTS: Analysing nine biopsy-derived MASLD outcomes of cohort size ranging between 5385 and 6673 subjects, we were able to predict individuals at training set AUCs ranging from 0.719-0.994, including classifying individuals who are At-Risk MASH at an AUC = 0.899. Using two further feature combinations of 26-variables and 35-variables, which included composite scores known to be good indicators for MASLD endpoints and advanced specialist tests, we found predictive performance did not sufficiently improve. We are also able to present local and global explanations for each ML model, offering clinicians interpretability without the expense of worsening predictive performance. CONCLUSIONS: This study developed a series of ML models of accuracy ranging from 71.9-99.4% using only easily extractable and readily available information in predicting MASLD outcomes which are usually determined through highly invasive means
Evaluation metrics for <i>’Core’</i> dataset performance upon predicting all response using XGBoost with MICE and SMOTE.
Evaluation metrics for ’Core’ dataset performance upon predicting all response using XGBoost with MICE and SMOTE.</p
SHAP force plots.
Force plots illustrating the impact of each feature upon the prediction of 4 random individual’s probability of At-Risk MASH. Top Left: A non-diabetic, 49 year old man of low fibrosis stage. Top Right: A diabetic, 69 year old woman of low fibrosis stage. Bottom Left: A non-diabetic 76 year old woman of high fibrosis stage. Bottom Right: A diabetic, 55 year old man of high fibrosis stage.</p