4,063 research outputs found

    The influence of feature selection methods on accuracy, stability and interpretability of molecular signatures

    Get PDF
    Motivation: Biomarker discovery from high-dimensional data is a crucial problem with enormous applications in biology and medicine. It is also extremely challenging from a statistical viewpoint, but surprisingly few studies have investigated the relative strengths and weaknesses of the plethora of existing feature selection methods. Methods: We compare 32 feature selection methods on 4 public gene expression datasets for breast cancer prognosis, in terms of predictive performance, stability and functional interpretability of the signatures they produce. Results: We observe that the feature selection method has a significant influence on the accuracy, stability and interpretability of signatures. Simple filter methods generally outperform more complex embedded or wrapper methods, and ensemble feature selection has generally no positive effect. Overall a simple Student's t-test seems to provide the best results. Availability: Code and data are publicly available at http://cbio.ensmp.fr/~ahaury/

    Ten years of image analysis and machine learning competitions in dementia

    Get PDF
    Machine learning methods exploiting multi-parametric biomarkers, especially based on neuroimaging, have huge potential to improve early diagnosis of dementia and to predict which individuals are at-risk of developing dementia. To benchmark algorithms in the field of machine learning and neuroimaging in dementia and assess their potential for use in clinical practice and clinical trials, seven grand challenges have been organized in the last decade. The seven grand challenges addressed questions related to screening, clinical status estimation, prediction and monitoring in (pre-clinical) dementia. There was little overlap in clinical questions, tasks and performance metrics. Whereas this aids providing insight on a broad range of questions, it also limits the validation of results across challenges. The validation process itself was mostly comparable between challenges, using similar methods for ensuring objective comparison, uncertainty estimation and statistical testing. In general, winning algorithms performed rigorous data preprocessing and combined a wide range of input features. Despite high state-of-the-art performances, most of the methods evaluated by the challenges are not clinically used. To increase impact, future challenges could pay more attention to statistical analysis of which factors relate to higher performance, to clinical questions beyond Alzheimer's disease, and to using testing data beyond the Alzheimer's Disease Neuroimaging Initiative. Grand challenges would be an ideal venue for assessing the generalizability of algorithm performance to unseen data of other cohorts. Key for increasing impact in this way are larger testing data sizes, which could be reached by sharing algorithms rather than data to exploit data that cannot be shared.Comment: 12 pages, 4 table

    Using network analysis for the prediction of treatment dropout in patients with mood and anxiety disorders: a methodological proof-of-concept study

    Get PDF
    There are large health, societal, and economic costs associated with attrition from psychological services. The recently emerged, innovative statistical tool of complex network analysis was used in the present proof-of-concept study to improve the prediction of attrition. Fifty-eight patients undergoing psychological treatment for mood or anxiety disorders were assessed using Ecological Momentary Assessments four times a day for two weeks before treatment (3,248 measurements). Multilevel vector autoregressive models were employed to compute dynamic symptom networks. Intake variables and network parameters (centrality measures) were used as predictors for dropout using machine-learning algorithms. Networks for patients differed significantly between completers and dropouts. Among intake variables, initial impairment and sex predicted dropout explaining 6% of the variance. The network analysis identified four additional predictors: Expected force of being excited, outstrength of experiencing social support, betweenness of feeling nervous, and instrength of being active. The final model with the two intake and four network variables explained 32% of variance in dropout and identified 47 out of 58 patients correctly. The findings indicate that patients’ dynamic network structures may improve the prediction of dropout. When implemented in routine care, such prediction models could identify patients at risk for attrition and inform personalized treatment recommendations.This work was supported by the German Research Foundation National Institute (DFG, Grant nos. LU 660/8-1 and LU 660/10-1 to W. Lutz). The funder of the study had no role in study design, data collection, data analysis, data interpretation, or writing of the manuscript. The corresponding author had access to all data in the study and had final responsibility for the decision to submit for publication. Dr. Hofmann receives financial support from the Alexander von Humboldt Foundation (as part of the Humboldt Prize), NIH/NCCIH (R01AT007257), NIH/NIMH (R01MH099021, U01MH108168), and the James S. McDonnell Foundation 21st Century Science Initiative in Understanding Human Cognition - Special Initiative. (LU 660/8-1 - German Research Foundation National Institute (DFG); LU 660/10-1 - German Research Foundation National Institute (DFG); Alexander von Humboldt Foundation; R01AT007257 - NIH/NCCIH; R01MH099021 - NIH/NIMH; U01MH108168 - NIH/NIMH; James S. McDonnell Foundation 21st Century Science Initiative in Understanding Human Cognition - Special Initiative)Accepted manuscrip

    The Alzheimer's Disease Prediction Of Longitudinal Evolution (TADPOLE) Challenge: Results after 1 Year Follow-up

    Get PDF
    We present the findings of "The Alzheimer's Disease Prediction Of Longitudinal Evolution" (TADPOLE) Challenge, which compared the performance of 92 algorithms from 33 international teams at predicting the future trajectory of 219 individuals at risk of Alzheimer's disease. Challenge participants were required to make a prediction, for each month of a 5-year future time period, of three key outcomes: clinical diagnosis, Alzheimer's Disease Assessment Scale Cognitive Subdomain (ADAS-Cog13), and total volume of the ventricles. No single submission was best at predicting all three outcomes. For clinical diagnosis and ventricle volume prediction, the best algorithms strongly outperform simple baselines in predictive ability. However, for ADAS-Cog13 no single submitted prediction method was significantly better than random guessing. Two ensemble methods based on taking the mean and median over all predictions, obtained top scores on almost all tasks. Better than average performance at diagnosis prediction was generally associated with the additional inclusion of features from cerebrospinal fluid (CSF) samples and diffusion tensor imaging (DTI). On the other hand, better performance at ventricle volume prediction was associated with inclusion of summary statistics, such as patient-specific biomarker trends. The submission system remains open via the website https://tadpole.grand-challenge.org, while code for submissions is being collated by TADPOLE SHARE: https://tadpole-share.github.io/. Our work suggests that current prediction algorithms are accurate for biomarkers related to clinical diagnosis and ventricle volume, opening up the possibility of cohort refinement in clinical trials for Alzheimer's disease

    Psoriasis prediction from genome-wide SNP profiles

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>With the availability of large-scale genome-wide association study (GWAS) data, choosing an optimal set of SNPs for disease susceptibility prediction is a challenging task. This study aimed to use single nucleotide polymorphisms (SNPs) to predict psoriasis from searching GWAS data.</p> <p>Methods</p> <p>Totally we had 2,798 samples and 451,724 SNPs. Process for searching a set of SNPs to predict susceptibility for psoriasis consisted of two steps. The first one was to search top 1,000 SNPs with high accuracy for prediction of psoriasis from GWAS dataset. The second one was to search for an optimal SNP subset for predicting psoriasis. The sequential information bottleneck (sIB) method was compared with classical linear discriminant analysis(LDA) for classification performance.</p> <p>Results</p> <p>The best test harmonic mean of sensitivity and specificity for predicting psoriasis by sIB was 0.674(95% CI: 0.650-0.698), while only 0.520(95% CI: 0.472-0.524) was reported for predicting disease by LDA. Our results indicate that the new classifier sIB performs better than LDA in the study.</p> <p>Conclusions</p> <p>The fact that a small set of SNPs can predict disease status with average accuracy of 68% makes it possible to use SNP data for psoriasis prediction.</p

    Machine Learning Framework to Identify Individuals at Risk of Rapid Progression of Coronary Atherosclerosis : From the PARADIGM Registry

    Get PDF
    Background Rapid coronary plaque progression (RPP) is associated with incident cardiovascular events. To date, no method exists for the identification of individuals at risk of RPP at a single point in time. This study integrated coronary computed tomography angiography-determined qualitative and quantitative plaque features within a machine learning (ML) framework to determine its performance for predicting RPP. Methods and Results Qualitative and quantitative coronary computed tomography angiography plaque characterization was performed in 1083 patients who underwent serial coronary computed tomography angiography from the PARADIGM (Progression of Atherosclerotic Plaque Determined by Computed Tomographic Angiography Imaging) registry. RPP was defined as an annual progression of percentage atheroma volume 651.0%. We employed the following ML models: model 1, clinical variables; model 2, model 1 plus qualitative plaque features; model 3, model 2 plus quantitative plaque features. ML models were compared with the atherosclerotic cardiovascular disease risk score, Duke coronary artery disease score, and a logistic regression statistical model. 224 patients (21%) were identified as RPP. Feature selection in ML identifies that quantitative computed tomography variables were higher-ranking features, followed by qualitative computed tomography variables and clinical/laboratory variables. ML model 3 exhibited the highest discriminatory performance to identify individuals who would experience RPP when compared with atherosclerotic cardiovascular disease risk score, the other ML models, and the statistical model (area under the receiver operating characteristic curve in ML model 3, 0.83 [95% CI 0.78-0.89], versus atherosclerotic cardiovascular disease risk score, 0.60 [0.52-0.67]; Duke coronary artery disease score, 0.74 [0.68-0.79]; ML model 1, 0.62 [0.55-0.69]; ML model 2, 0.73 [0.67-0.80]; all P&lt;0.001; statistical model, 0.81 [0.75-0.87], P=0.128). Conclusions Based on a ML framework, quantitative atherosclerosis characterization has been shown to be the most important feature when compared with clinical, laboratory, and qualitative measures in identifying patients at risk of RPP
    • …
    corecore