64 research outputs found

    A Validated Clinical Risk Prediction Model for Lung Cancer in Smokers of All Ages and Exposure Types:A HUNT Study

    Get PDF
    Lung cancer causes >1·6 million deaths annually, with early diagnosis being paramount to effective treatment. Here we present a validated risk assessment model for lung cancer screening.The prospective HUNT2 population study in Norway examined 65,237 people aged >20 years in 1995–97. After a median of 15·2 years, 583 lung cancer cases had been diagnosed; 552 (94·7%) ever-smokers and 31 (5·3%) never-smokers. We performed multivariable analyses of 36 candidate risk predictors, using multiple imputation of missing data and backwards feature selection with Cox regression. The resulting model was validated in an independent Norwegian prospective dataset of 45,341 ever-smokers, in which 675 lung cancers had been diagnosed after a median follow-up of 11·6 years.Our final HUNT Lung Cancer Model included age, pack-years, smoking intensity, years since smoking cessation, body mass index, daily cough, and hours of daily indoors exposure to smoke. External validation showed a 0·879 concordance index (95% CI [0·866–0·891]) with an area under the curve of 0·87 (95% CI [0·85–0·89]) within 6 years. Only 22% of ever-smokers would need screening to identify 81·85% of all lung cancers within 6 years.Our model of seven variables is simple, accurate, and useful for screening selection. Keywords: Early diagnosis, Lung cancer prediction, Ever-smokers, All smokers, All ages, Data-driven, Feature selection, External validatio

    Automated machine learning optimizes and accelerates predictive modeling from COVID-19 high throughput datasets

    Get PDF
    COVID-19 outbreak brings intense pressure on healthcare systems, with an urgent demand for effective diagnostic, prognostic and therapeutic procedures. Here, we employed Automated Machine Learning (AutoML) to analyze three publicly available high throughput COVID-19 datasets, including proteomic, metabolomic and transcriptomic measurements. Pathway analysis of the selected features was also performed. Analysis of a combined proteomic and metabolomic dataset led to 10 equivalent signatures of two features each, with AUC 0.840 (CI 0.723–0.941) in discriminating severe from non-severe COVID-19 patients. A transcriptomic dataset led to two equivalent signatures of eight features each, with AUC 0.914 (CI 0.865–0.955) in identifying COVID-19 patients from those with a different acute respiratory illness. Another transcriptomic dataset led to two equivalent signatures of nine features each, with AUC 0.967 (CI 0.899–0.996) in identifying COVID-19 patients from virus-free individuals. Signature predictive performance remained high upon validation. Multiple new features emerged and pathway analysis revealed biological relevance by implication in Viral mRNA Translation, Interferon gamma signaling and Innate Immune System pathways. In conclusion, AutoML analysis led to multiple biosignatures of high predictive performance, with reduced features and large choice of alternative predictors. These favorable characteristics are eminent for development of cost-effective assays to contribute to better disease management

    A novel similarity-measure for the analysis of genetic data in complex phenotypes

    Get PDF
    BACKGROUND: Recent technological advances in DNA sequencing and genotyping have led to the accumulation of a remarkable quantity of data on genetic polymorphisms. However, the development of new statistical and computational tools for effective processing of these data has not been equally as fast. In particular, Machine Learning literature is limited to relatively few papers which are focused on the development and application of data mining methods for the analysis of genetic variability. On the other hand, these papers apply to genetic data procedures which had been developed for a different kind of analysis and do not take into account the peculiarities of population genetics. The aim of our study was to define a new similarity measure, specifically conceived for measuring the similarity between the genetic profiles of two groups of subjects (i.e., cases and controls) taking into account that genetic profiles are usually distributed in a population group according to the Hardy Weinberg equilibrium. RESULTS: We set up a new kernel function consisting of a similarity measure between groups of subjects genotyped for numerous genetic loci. This measure weighs different genetic profiles according to the estimates of gene frequencies at Hardy-Weinberg equilibrium in the population. We named this function the "Hardy-Weinberg kernel". The effectiveness of the Hardy-Weinberg kernel was compared to the performance of the well established linear kernel. We found that the Hardy-Weinberg kernel significantly outperformed the linear kernel in a number of experiments where we used either simulated data or real data. CONCLUSION: The "Hardy-Weinberg kernel" reported here represents one of the first attempts at incorporating genetic knowledge into the definition of a kernel function designed for the analysis of genetic data. We show that the best performance of the "Hardy-Weinberg kernel" is observed when rare genotypes have different frequencies in cases and controls. The ability to capture the effect of rare genotypes on phenotypic traits might be a very important and useful feature, as most of the current statistical tools loose most of their statistical power when rare genotypes are involved in the susceptibility to the trait under study

    Improving Lung Cancer Screening Selection:The HUNT Lung Cancer Risk Model for Ever-Smokers Versus the NELSON and 2021 United States Preventive Services Task Force Criteria in the Cohort of Norway: A Population-Based Prospective Study

    Get PDF
    Background: Improving the method for selecting participants for lung cancer (LC) screening is an urgent need. Here, we compared the performance of the Helseundersøkelsen i Nord-Trøndelag (HUNT) Lung Cancer Model (HUNT LCM) versus the Dutch-Belgian lung cancer screening trial (Nederlands-Leuvens Longkanker Screenings Onderzoek (NELSON)) and 2021 United States Preventive Services Task Force (USPSTF) criteria regarding LC risk prediction and efficiency. Methods: We used linked data from 10 Norwegian prospective population-based cohorts, Cohort of Norway. The study included 44,831 ever-smokers, of which 686 (1.5%) patients developed LC; the median follow-up time was 11.6 years (0.01–20.8 years). Results: Within 6 years, 222 (0.5%) individuals developed LC. The NELSON and 2021 USPSTF criteria predicted 37.4% and 59.5% of the LC cases, respectively. By considering the same number of individuals as the NELSON and 2021 USPSTF criteria selected, the HUNT LCM increased the LC prediction rate by 41.0% and 12.1%, respectively. The HUNT LCM significantly increased sensitivity (p &lt; 0.001 and p = 0.028), and reduced the number needed to predict one LC case (29 versus 40, p &lt; 0.001 and 36 versus 40, p = 0.02), respectively. Applying the HUNT LCM 6-year 0.98% risk score as a cutoff (14.0% of ever-smokers) predicted 70.7% of all LC, increasing LC prediction rate with 89.2% and 18.9% versus the NELSON and 2021 USPSTF, respectively (both p &lt; 0.001). Conclusions: The HUNT LCM was significantly more efficient than the NELSON and 2021 USPSTF criteria, improving the prediction of LC diagnosis, and may be used as a validated clinical tool for screening selection.</p

    Src and Memory: A Study of Filial Imprinting and Predispositions in the Domestic Chick.

    Get PDF
    Visual imprinting is a learning process whereby young animals come to prefer a visual stimulus after exposure to it (training). The available evidence indicates that the intermediate medial mesopallium (IMM) in the domestic chick forebrain is a site of memory formation during visual imprinting. We have studied the role of Src, an important non-receptor tyrosine kinase, in memory formation. Amounts of total Src (Total-Src) and its two phosphorylated forms, tyrosine-416 (activated, 416P-Src) and tyrosine-527 (inhibited, 527P-Src), were measured 1 and 24 h after training in the IMM and in a control brain region, the posterior pole of nidopallium (PPN). One hour after training, in the left IMM, we observed a positive correlation between the amount of 527P-Src and learning strength that was attributable to learning, and there was also a positive correlation between 416P-Src and learning strength that was attributable to a predisposition to learn readily. Twenty-four hours after training, the amount of Total-Src increased with learning strength in both the left and right IMM, and amount of 527P-Src increased with learning strength only in the left IMM; both correlations were attributable to learning. A further, negative, correlation between learning strength and 416P-Src/Total-Src in the left IMM reflected a predisposition to learn. No learning-related changes were found in the PPN control region. We suggest that there are two pools of Src; one of them in an active state and reflecting a predisposition to learn, and the second one in an inhibited condition, which increases as a result of learning. These two pools may represent two or more signaling pathways, namely, one pathway downstream of Src activated by tyrosine-416 phosphorylation and another upstream of Src, keeping the enzyme in an inactivated state via phosphorylation of tyrosine-527

    Feature Selection with the R Package MXM: Discovering Statistically-Equivalent Feature Subsets

    Get PDF
    The statistically equivalent signature (SES) algorithm is a method for feature selection inspired by the principles of constrained-based learning of Bayesian Networks. Most of the currently available feature-selection methods return only a single subset of features, supposedly the one with the highest predictive power. We argue that in several domains multiple subsets can achieve close to maximal predictive accuracy, and that arbitrarily providing only one has several drawbacks. The SES method attempts to identify multiple, predictive feature subsets whose performances are statistically equivalent. Under that respect SES subsumes and extends previous feature selection algorithms, like the maxmin parent children algorithm. SES is implemented in an homonym function included in the R package MXM, standing for mens ex machina, meaning 'mind from the machine' in Latin. The MXM implementation of SES handles several data-analysis tasks, namely classi�cation, regression and survival analysis. In this paper we present the SES algorithm, its implementation, and provide examples of use of the SES function in R. Furthermore, we analyze three publicly available data sets to illustrate the equivalence of the signatures retrieved by SES and to contrast SES against the state-of-the-art feature selection method LASSO. Our results provide initial evidence that the two methods perform comparably well in terms of predictive accuracy and that multiple, equally predictive signatures are actually present in real world data

    IHCV: Discovery of Hidden Time-Dependent Control Variables in Non-Linear Dynamical Systems

    Full text link
    Discovering non-linear dynamical models from data is at the core of science. Recent progress hinges upon sparse regression of observables using extensive libraries of candidate functions. However, it remains challenging to model hidden non-observable control variables governing switching between different dynamical regimes. Here we develop a data-efficient derivative-free method, IHCV, for the Identification of Hidden Control Variables. First, the performance and robustness of IHCV against noise are evaluated by benchmarking the IHCV method using well-known bifurcation models (saddle-node, transcritical, pitchfork, Hopf). Next, we demonstrate that IHCV discovers hidden driver variables in the Lorenz, van der Pol, Hodgkin-Huxley, and Fitzhugh-Nagumo models. Finally, IHCV generalizes to the case when only partial observational is given, as demonstrated using the toggle switch model, the genetic repressilator oscillator, and a Waddington landscape model. Our proof-of-principle illustrates that utilizing normal forms could facilitate the data-efficient and scalable discovery of hidden variables controlling transitions between different dynamical regimes and non-linear models.Comment: 12 pages, 2 figure

    Non-parametric combination analysis of multiple data types enables detection of novel regulatory mechanisms in T cells of multiple sclerosis patients

    Get PDF
    Multiple Sclerosis (MS) is an autoimmune disease of the central nervous system with prominent neurodegenerative components. The triggering and progression of MS is associated with transcriptional and epigenetic alterations in several tissues, including peripheral blood. The combined influence of transcriptional and epigenetic changes associated with MS has not been assessed in the same individuals. Here we generated paired transcriptomic (RNA-seq) and DNA methylation (Illumina 450 K array) profiles of CD4+ and CD8+ T cells (CD4, CD8), using clinically accessible blood from healthy donors and MS patients in the initial relapsing-remitting and subsequent secondary-progressive stage. By integrating the output of a differential expression test with a permutation-based non-parametric combination methodology, we identified 149 differentially expressed (DE) genes in both CD4 and CD8 cells collected from MS patients. Moreover, by leveraging the methylation-dependent regulation of gene expression, we identified the gene SH3YL1, which displayed significant correlated expression and methylation changes in MS patients. Importantly, silencing of SH3YL1 in primary human CD4 cells demonstrated its influence on T cell activation. Collectively, our strategy based on paired sampling of several cell-types provides a novel approach to increase sensitivity for identifying shared mechanisms altered in CD4 and CD8 cells of relevance in MS in small sized clinical materials
    • …
    corecore