45 research outputs found

    Adaptive predictor-set linear model:An imputation-free method for linear regression prediction on data sets with missing values

    Get PDF
    Linear regression (LR) is vastly used in data analysis for continuous outcomes in biomedicine and epidemiology. Despite its popularity, LR is incompatible with missing data, which frequently occur in health sciences. For parameter estimation, this shortcoming is usually resolved by complete-case analysis or imputation. Both work-arounds, however, are inadequate for prediction, since they either fail to predict on incomplete records or ignore missingness-induced reduction in prediction accuracy and rely on (unrealistic) assumptions about the missing mechanism. Here, we derive adaptive predictor-set linear model (aps-lm), capable of making predictions for incomplete data without the need for imputation. It is derived by using a predictor-selection operation, the Moore–Penrose pseudoinverse, and the reduced QR decomposition. aps-lm is an LR generalization that inherently handles missing values. It is applied on a reference data set, where complete predictors and outcome are available, and yields a set of privacy-preserving parameters. In a second stage, these are shared for making predictions of the outcome on external data sets with missing entries for predictors without imputation. Moreover, aps-lm computes prediction errors that account for the pattern of missing values even under extreme missingness. We benchmark aps-lm in a simulation study. aps-lm showed greater prediction accuracy and reduced bias compared to popular imputation strategies under a wide range of scenarios including variation of sample size, goodness of fit, missing value type, and covariance structure. Finally, as a proof-of-principle, we apply aps-lm in the context of epigenetic aging clocks, linear models that predict a person's biological age from epigenetic data with promising clinical applications.</p

    Adaptive predictor-set linear model:An imputation-free method for linear regression prediction on data sets with missing values

    Get PDF
    Linear regression (LR) is vastly used in data analysis for continuous outcomes in biomedicine and epidemiology. Despite its popularity, LR is incompatible with missing data, which frequently occur in health sciences. For parameter estimation, this shortcoming is usually resolved by complete-case analysis or imputation. Both work-arounds, however, are inadequate for prediction, since they either fail to predict on incomplete records or ignore missingness-induced reduction in prediction accuracy and rely on (unrealistic) assumptions about the missing mechanism. Here, we derive adaptive predictor-set linear model (aps-lm), capable of making predictions for incomplete data without the need for imputation. It is derived by using a predictor-selection operation, the Moore–Penrose pseudoinverse, and the reduced QR decomposition. aps-lm is an LR generalization that inherently handles missing values. It is applied on a reference data set, where complete predictors and outcome are available, and yields a set of privacy-preserving parameters. In a second stage, these are shared for making predictions of the outcome on external data sets with missing entries for predictors without imputation. Moreover, aps-lm computes prediction errors that account for the pattern of missing values even under extreme missingness. We benchmark aps-lm in a simulation study. aps-lm showed greater prediction accuracy and reduced bias compared to popular imputation strategies under a wide range of scenarios including variation of sample size, goodness of fit, missing value type, and covariance structure. Finally, as a proof-of-principle, we apply aps-lm in the context of epigenetic aging clocks, linear models that predict a person's biological age from epigenetic data with promising clinical applications.</p

    Investigating the epigenetic discrimination of identical twins using buccal swabs, saliva, and cigarette butts in the forensic setting

    Get PDF
    Monozygotic (MZ) twins are typically indistinguishable via forensic DNA profiling. Recently, we demonstrated that epigenetic differentiation of MZ twins is feasible; however, proportions of twin differentially methylated CpG sites (tDMSs) identified in reference-type blood DNA were not replicated in trace-type blood DNA. Here we investigated buccal swabs as typical forensic reference material, and saliva and cigarette butts as commonly encountered forensic trace materials. As an analog to a forensic case, we analyzed one MZ twin pair. Epigenome-wide microarray analysis in reference-type buccal DNA revealed 25 candidate tDMSs with >0.5 twin-to-twin differences. MethyLight quantitative PCR (qPCR) of 22 selected tDMSs in trace-type DNA revealed in saliva DNA that six tDMSs (27.3%) had >0.1 twin-to-twin differences, seven (31.8%) had smaller (<0.1) but robustly detected differences, whereas for nine (40.9%) the differences were in the opposite direction relative to the microarray data; for cigarette butt DNA, results were 50%, 22.7%, and 27.3%, respectively. The discrepancies between reference-type and trace-type DNA outcomes can be explained by cell composition differences, method-to-method variation, and other technical reasons including bisulfite conversion inefficiency. Our study highlights the importance of the DNA source and that careful characterization of biological and technical effects is needed before epigenetic MZ twin differentiation is applicable in forensic casework

    Validated inference of smoking habits from blood with a finite DNA methylation marker set

    Get PDF
    Inferring a person’s smoking habit and history from blood is relevant for complementing or replacing self-reports in epidemiological and public health research, and for forensic applications. However, a finite DNA methylation marker set and a validated statistical model based on a large dataset are not yet available. Employing 14 epigenome-wide association studies for marker discovery, and using data from six population-based cohorts (N = 3764) for model building, we identified 13 CpGs most suitable for inferring smoking versus non-smoking status from blood with a cumulative Area Under the Curve (AUC) of 0.901. Internal fivefold cross-validation yielded an average AUC of 0.897 ± 0.137, while external model validation in an independent population-based cohort (

    Novel Uses of Epigenetics in Forensic Science

    Get PDF

    From forensic epigenetics to forensic epigenomics: Broadening DNA investigative intelligence

    Get PDF
    Human genetic variation is a major resource in forensics, but does not allow all forensically relevant questions to be answered. Some questions may instead be addressable via epigeno
    corecore