15 research outputs found
Pulmonologists-Level lung cancer detection based on standard blood test results and smoking status using an explainable machine learning approach
Lung cancer (LC) remains the primary cause of cancer-related mortality,
largely due to late-stage diagnoses. Effective strategies for early detection
are therefore of paramount importance. In recent years, machine learning (ML)
has demonstrated considerable potential in healthcare by facilitating the
detection of various diseases. In this retrospective development and validation
study, we developed an ML model based on dynamic ensemble selection (DES) for
LC detection. The model leverages standard blood sample analysis and smoking
history data from a large population at risk in Denmark. The study includes all
patients examined on suspicion of LC in the Region of Southern Denmark from
2009 to 2018. We validated and compared the predictions by the DES model with
diagnoses provided by five pulmonologists. Among the 38,944 patients, 9,940 had
complete data of which 2,505 (25\%) had LC. The DES model achieved an area
under the roc curve of 0.770.01, sensitivity of 76.2\%2.4\%,
specificity of 63.8\%2.3\%, positive predictive value of 41.6\%1.2\%,
and F\textsubscript{1}-score of 53.8\%1.1\%. The DES model outperformed
all five pulmonologists, achieving a sensitivity 9\% higher than their average.
The model identified smoking status, age, total calcium levels, neutrophil
count, and lactate dehydrogenase as the most important factors for the
detection of LC. The results highlight the successful application of the ML
approach in detecting LC, surpassing pulmonologists' performance. Incorporating
clinical and laboratory data in future risk assessment models can improve
decision-making and facilitate timely referrals.Comment: 9 pages, 4 figure
Serum vitamin K1 associated to microangiopathy and/or macroangiopathy in individuals with and without diabetes
ObjectiveVitamin K has proposed beneficial effects on cardiovascular health. We investigated whether serum vitamin K1 was associated with prevalence of microangiopathy and/or macroangiopathy.Research design and methodsSerum vitamin K was quantified in 3239 individuals with and 3808 without diabetes enrolled in Vejle Diabetes Biobank (2007–2010). Each individual was assessed for microangiography and macroangiopathy at enrollment based on registered diagnoses in the Danish National Patient Registry according to the International Classification of Disease 8 (1977–1993) and 10 (since 1994). Using multinomial logistic regression, relative risk ratios (RRRs) were calculated within each group of individuals with and without diabetes. RRRs were estimated for microangiopathic/macroangiopathic status compared with individuals without complications as a function of 1 nmol/L increments in K1. Adjustment for potential confounders was also performed.ResultsVitamin K1 (median) varied 0.86–0.95 nmol/L depending on diabetes, microangiopathic and macroangiopathic status. In individuals with diabetes, the crude RRR for only having microangiopathy was 1.05 (95% CI 0.98 to 1.12) and was found significant when adjusting 1.10 (95% CI 1.01 to 1.19). RRR for having only macroangiopathy was 0.89 (95% CI 0.77 to 1.03) and was again significant when adjusting 0.79 (95% CI 0.66 to 0.96). In individuals without diabetes, adjustments again led to similar estimates that was not significant. The adjusted RRR for having only macroangiopathy was 1.08 (95% CI 0.98 to 1.19).ConclusionsSerum vitamin K1 levels were associated with microangiopathic and macroangiopathic status in individuals with diabetes, but considered of no clinical relevance. The clinical value of other candidate markers for vitamin K status needs to be evaluated in future studies
Quantification of microRNA levels in plasma - Impact of preanalytical and analytical conditions.
Numerous studies have reported a potential role for circulating microRNAs as biomarkers in a wide variety of diseases. However, there is a critical reproducibility challenge some of which might be due to differences in preanalytical and/or analytical factors. Thus, in the current study we systematically investigated the impact of selected preanalytical and analytical variables on the measured microRNA levels in plasma. Similar levels of microRNA were found in platelet-poor plasma obtained by dual compared to prolonged single centrifugation. In contrast, poor correlation was observed between measurements in standard plasma compared to platelet-poor plasma. The correlation between quantitative real-time PCR and droplet digital PCR was found to be good, contrary to TaqMan Low Density Array and single TaqMan assays where no correlation could be demonstrated. Dependent on the specific microRNA measured and the normalization strategy used, the intra- and inter-assay variation of quantitative real-time PCR were found to be 4.2-6.8% and 10.5-31.4%, respectively. Using droplet digital PCR the intra-assay variation was 4.4-20.1%, and the inter-assay variation 5.7-26.7%. Plasma preparation and microRNA purification were found to account for 39-73% of the total intra-assay variation, dependent on the microRNA measured and the normalization strategy used. In conclusion, our study highlighted the importance of reporting comprehensive methodological information when publishing, allowing others to perform validation studies where preanalytical and analytical variables as causes for divergent results can be minimized. Furthermore, if microRNAs are to become routinely used diagnostic or prognostic biomarkers, the differences in plasma microRNA levels between health and diseased subjects must exceed the high preanalytical and analytical variability
Identification of patients’ smoking status using an explainable AI approach: a Danish electronic health records case study
Abstract Background Smoking is a critical risk factor responsible for over eight million annual deaths worldwide. It is essential to obtain information on smoking habits to advance research and implement preventive measures such as screening of high-risk individuals. In most countries, including Denmark, smoking habits are not systematically recorded and at best documented within unstructured free-text segments of electronic health records (EHRs). This would require researchers and clinicians to manually navigate through extensive amounts of unstructured data, which is one of the main reasons that smoking habits are rarely integrated into larger studies. Our aim is to develop machine learning models to classify patients’ smoking status from their EHRs. Methods This study proposes an efficient natural language processing (NLP) pipeline capable of classifying patients’ smoking status and providing explanations for the decisions. The proposed NLP pipeline comprises four distinct components, which are; (1) considering preprocessing techniques to address abbreviations, punctuation, and other textual irregularities, (2) four cutting-edge feature extraction techniques, i.e. Embedding, BERT, Word2Vec, and Count Vectorizer, employed to extract the optimal features, (3) utilization of a Stacking-based Ensemble (SE) model and a Convolutional Long Short-Term Memory Neural Network (CNN-LSTM) for the identification of smoking status, and (4) application of a local interpretable model-agnostic explanation to explain the decisions rendered by the detection models. The EHRs of 23,132 patients with suspected lung cancer were collected from the Region of Southern Denmark during the period 1/1/2009-31/12/2018. A medical professional annotated the data into ‘Smoker’ and ‘Non-Smoker’ with further classifications as ‘Active-Smoker’, ‘Former-Smoker’, and ‘Never-Smoker’. Subsequently, the annotated dataset was used for the development of binary and multiclass classification models. An extensive comparison was conducted of the detection performance across various model architectures. Results The results of experimental validation confirm the consistency among the models. However, for binary classification, BERT method with CNN-LSTM architecture outperformed other models by achieving precision, recall, and F1-scores between 97% and 99% for both Never-Smokers and Active-Smokers. In multiclass classification, the Embedding technique with CNN-LSTM architecture yielded the most favorable results in class-specific evaluations, with equal performance measures of 97% for Never-Smoker and measures in the range of 86 to 89% for Active-Smoker and 91–92% for Never-Smoker. Conclusion Our proposed NLP pipeline achieved a high level of classification performance. In addition, we presented the explanation of the decision made by the best performing detection model. Future work will expand the model’s capabilities to analyze longer notes and a broader range of categories to maximize its utility in further research and screening applications
Overview of experiments.
<p>Experiments outlined in A) were used to compare dual and prolonged single centrifugation (experiment 1) and to compare qPCR and ddPCR with respect to precision and repeatability (part of experiment 4). Experiments outlined in B) were used to investigate the correlation between qPCR (single assays) and ddPCR (part of experiment 4), correlation between single TaqMan assays and TaqMan Low Density Array (experiment 3) and correlation between TaqMan assays performed with microRNA purified from standard plasma and PPP, respectively (experiment 2).</p