1,436 research outputs found

    Clinically focused multi-cohort benchmarking as a tool for external validation of artificial intelligence algorithm performance in basic chest radiography analysis

    Get PDF
    Artificial intelligence (AI) algorithms evaluating [supine] chest radiographs ([S]CXRs) have remarkably increased in number recently. Since training and validation are often performed on subsets of the same overall dataset, external validation is mandatory to reproduce results and reveal potential training errors. We applied a multicohort benchmarking to the publicly accessible (S)CXR analyzing AI algorithm CheXNet, comprising three clinically relevant study cohorts which differ in patient positioning ([S]CXRs), the applied reference standards (CT-/[S]CXR-based) and the possibility to also compare algorithm classification with different medical experts’ reading performance. The study cohorts include [1] a cohort, characterized by 563 CXRs acquired in the emergency unit that were evaluated by 9 readers (radiologists and non-radiologists) in terms of 4 common pathologies, [2] a collection of 6,248 SCXRs annotated by radiologists in terms of pneumothorax presence, its size and presence of inserted thoracic tube material which allowed for subgroup and confounding bias analysis and [3] a cohort consisting of 166 patients with SCXRs that were evaluated by radiologists for underlying causes of basal lung opacities, all of those cases having been correlated to a timely acquired computed tomography scan (SCXR and CT within < 90 min). CheXNet non-significantly exceeded the radiology resident (RR) consensus in the detection of suspicious lung nodules (cohort [1], AUC AI/RR: 0.851/0.839, p = 0.793) and the radiological readers in the detection of basal pneumonia (cohort [3], AUC AI/reader consensus: 0.825/0.782, p = 0.390) and basal pleural effusion (cohort [3], AUC AI/reader consensus: 0.762/0.710, p = 0.336) in SCXR, partly with AUC values higher than originally published (“Nodule”: 0.780, “Infiltration”: 0.735, “Effusion”: 0.864). The classifier “Infiltration” turned out to be very dependent on patient positioning (best in CXR, worst in SCXR). The pneumothorax SCXR cohort [2] revealed poor algorithm performance in CXRs without inserted thoracic material and in the detection of small pneumothoraces, which can be explained by a known systematic confounding error in the algorithm training process. The benefit of clinically relevant external validation is demonstrated by the differences in algorithm performance as compared to the original publication. Our multi-cohort benchmarking finally enables the consideration of confounders, different reference standards and patient positioning as well as the AI performance comparison with differentially qualified medical readers

    On Interpretability of Deep Learning based Skin Lesion Classifiers using Concept Activation Vectors

    Full text link
    Deep learning based medical image classifiers have shown remarkable prowess in various application areas like ophthalmology, dermatology, pathology, and radiology. However, the acceptance of these Computer-Aided Diagnosis (CAD) systems in real clinical setups is severely limited primarily because their decision-making process remains largely obscure. This work aims at elucidating a deep learning based medical image classifier by verifying that the model learns and utilizes similar disease-related concepts as described and employed by dermatologists. We used a well-trained and high performing neural network developed by REasoning for COmplex Data (RECOD) Lab for classification of three skin tumours, i.e. Melanocytic Naevi, Melanoma and Seborrheic Keratosis and performed a detailed analysis on its latent space. Two well established and publicly available skin disease datasets, PH2 and derm7pt, are used for experimentation. Human understandable concepts are mapped to RECOD image classification model with the help of Concept Activation Vectors (CAVs), introducing a novel training and significance testing paradigm for CAVs. Our results on an independent evaluation set clearly shows that the classifier learns and encodes human understandable concepts in its latent representation. Additionally, TCAV scores (Testing with CAVs) suggest that the neural network indeed makes use of disease-related concepts in the correct way when making predictions. We anticipate that this work can not only increase confidence of medical practitioners on CAD but also serve as a stepping stone for further development of CAV-based neural network interpretation methods.Comment: Accepted for the IEEE International Joint Conference on Neural Networks (IJCNN) 202

    Deep learning approach for cardiovascular disease risk stratification and survival analysis on a Canadian cohort

    Get PDF
    The quantification of carotid plaque has been routinely used to predict cardiovascular risk in cardiovascular disease (CVD) and coronary artery disease (CAD). To determine how well carotid plaque features predict the likelihood of CAD and cardiovascular (CV) events using deep learning (DL) and compare against the machine learning (ML) paradigm. The participants in this study consisted of 459 individuals who had undergone coronary angiography, contrast-enhanced ultrasonography, and focused carotid B-mode ultrasound. Each patient was tracked for thirty days. The measurements on these patients consisted of maximum plaque height (MPH), total plaque area (TPA), carotid intima-media thickness (cIMT), and intraplaque neovascularization (IPN). CAD risk and CV event stratification were performed by applying eight types of DL-based models. Univariate and multivariate analysis was also conducted to predict the most significant risk predictors. The DL's model effectiveness was evaluated by the area-under-the-curve measurement while the CV event prediction was evaluated using the Cox proportional hazard model (CPHM) and compared against the DL-based concordance index (c-index). IPN showed a substantial ability to predict CV events (p &lt; 0.0001). The best DL system improved by 21% (0.929 vs. 0.762) over the best ML system. DL-based CV event prediction showed a similar to 17% increase in DL-based c-index compared to the CPHM (0.86 vs. 0.73). CAD and CV incidents were linked to IPN and carotid imaging characteristics. For survival analysis and CAD prediction, the DL-based system performs superior to ML-based models

    Benchmarking weakly-supervised deep learning pipelines for whole slide classification in computational pathology.

    Get PDF
    Artificial intelligence (AI) can extract visual information from histopathological slides and yield biological insight and clinical biomarkers. Whole slide images are cut into thousands of tiles and classification problems are often weakly-supervised: the ground truth is only known for the slide, not for every single tile. In classical weakly-supervised analysis pipelines, all tiles inherit the slide label while in multiple-instance learning (MIL), only bags of tiles inherit the label. However, it is still unclear how these widely used but markedly different approaches perform relative to each other. We implemented and systematically compared six methods in six clinically relevant end-to-end prediction tasks using data from N=2980 patients for training with rigorous external validation. We tested three classical weakly-supervised approaches with convolutional neural networks and vision transformers (ViT) and three MIL-based approaches with and without an additional attention module. Our results empirically demonstrate that histological tumor subtyping of renal cell carcinoma is an easy task in which all approaches achieve an area under the receiver operating curve (AUROC) of above 0.9. In contrast, we report significant performance differences for clinically relevant tasks of mutation prediction in colorectal, gastric, and bladder cancer. In these mutation prediction tasks, classical weakly-supervised workflows outperformed MIL-based weakly-supervised methods for mutation prediction, which is surprising given their simplicity. This shows that new end-to-end image analysis pipelines in computational pathology should be compared to classical weakly-supervised methods. Also, these findings motivate the development of new methods which combine the elegant assumptions of MIL with the empirically observed higher performance of classical weakly-supervised approaches. We make all source codes publicly available at https://github.com/KatherLab/HIA, allowing easy application of all methods to any similar task

    DermAI 1.0: A Robust, Generalized, and Novel Attention-Enabled Ensemble-Based Transfer Learning Paradigm for Multiclass Classification of Skin Lesion Images

    Get PDF
    Skin lesion classification plays a crucial role in dermatology, aiding in the early detection, diagnosis, and management of life-threatening malignant lesions. However, standalone transfer learning (TL) models failed to deliver optimal performance. In this study, we present an attention-enabled ensemble-based deep learning technique, a powerful, novel, and generalized method for extracting features for the classification of skin lesions. This technique holds significant promise in enhancing diagnostic accuracy by using seven pre-trained TL models for classification. Six ensemble-based DL (EBDL) models were created using stacking, softmax voting, and weighted average techniques. Furthermore, we investigated the attention mechanism as an effective paradigm and created seven attention-enabled transfer learning (aeTL) models before branching out to construct three attention-enabled ensemble-based DL (aeEBDL) models to create a reliable, adaptive, and generalized paradigm. The mean accuracy of the TL models is 95.30%, and the use of an ensemble-based paradigm increased it by 4.22%, to 99.52%. The aeTL models' performance was superior to the TL models in accuracy by 3.01%, and aeEBDL models outperformed aeTL models by 1.29%. Statistical tests show significant p-value and Kappa coefficient along with a 99.6% reliability index for the aeEBDL models. The approach is highly effective and generalized for the classification of skin lesions

    Predictive analytics framework for electronic health records with machine learning advancements : optimising hospital resources utilisation with predictive and epidemiological models

    Get PDF
    The primary aim of this thesis was to investigate the feasibility and robustness of predictive machine-learning models in the context of improving hospital resources’ utilisation with data- driven approaches and predicting hospitalisation with hospital quality assessment metrics such as length of stay. The length of stay predictions includes the validity of the proposed methodological predictive framework on each hospital’s electronic health records data source. In this thesis, we relied on electronic health records (EHRs) to drive a data-driven predictive inpatient length of stay (LOS) research framework that suits the most demanding hospital facilities for hospital resources’ utilisation context. The thesis focused on the viability of the methodological predictive length of stay approaches on dynamic and demanding healthcare facilities and hospital settings such as the intensive care units and the emergency departments. While the hospital length of stay predictions are (internal) healthcare inpatients outcomes assessment at the time of admission to discharge, the thesis also considered (external) factors outside hospital control, such as forecasting future hospitalisations from the spread of infectious communicable disease during pandemics. The internal and external splits are the thesis’ main contributions. Therefore, the thesis evaluated the public health measures during events of uncertainty (e.g. pandemics) and measured the effect of non-pharmaceutical intervention during outbreaks on future hospitalised cases. This approach is the first contribution in the literature to examine the epidemiological curves’ effect using simulation models to project the future hospitalisations on their strong potential to impact hospital beds’ availability and stress hospital workflow and workers, to the best of our knowledge. The main research commonalities between chapters are the usefulness of ensembles learning models in the context of LOS for hospital resources utilisation. The ensembles learning models anticipate better predictive performance by combining several base models to produce an optimal predictive model. These predictive models explored the internal LOS for various chronic and acute conditions using data-driven approaches to determine the most accurate and powerful predicted outcomes. This eventually helps to achieve desired outcomes for hospital professionals who are working in hospital settings

    A Performance-Explainability-Fairness Framework For Benchmarking ML Models

    Get PDF
    Machine learning (ML) models have achieved remarkable success in various applications; however, ensuring their robustness and fairness remains a critical challenge. In this research, we present a comprehensive framework designed to evaluate and benchmark ML models through the lenses of performance, explainability, and fairness. This framework addresses the increasing need for a holistic assessment of ML models, considering not only their predictive power but also their interpretability and equitable deployment. The proposed framework leverages a multi-faceted evaluation approach, integrating performance metrics with explainability and fairness assessments. Performance evaluation incorporates standard measures such as accuracy, precision, and recall, but extends to overall balanced error rate, overall area under the receiver operating characteristic (ROC) curve (AUC), to capture model behavior across different performance aspects. Explainability assessment employs state-of-the-art techniques to quantify the interpretability of model decisions, ensuring that model behavior can be understood and trusted by stakeholders. The fairness evaluation examines model predictions in terms of demographic parity, equalized odds, thereby addressing concerns of bias and discrimination in the deployment of ML systems. To demonstrate the practical utility of the framework, we apply it to a diverse set of ML algorithms across various functional domains, including finance, criminology, education, and healthcare prediction. The results showcase the importance of a balanced evaluation approach, revealing trade-offs between performance, explainability, and fairness that can inform model selection and deployment decisions. Furthermore, we provide insights into the analysis of tradeoffs in selecting the appropriate model for use cases where performance, interpretability and fairness are important. In summary, the Performance-Explainability-Fairness Framework offers a unified methodology for evaluating and benchmarking ML models, enabling practitioners and researchers to make informed decisions about model suitability and ensuring responsible and equitable AI deployment. We believe that this framework represents a crucial step towards building trustworthy and accountable ML systems in an era where AI plays an increasingly prominent role in decision-making processes

    COVLIAS 2.0-cXAI: Cloud-Based Explainable Deep Learning System for COVID-19 Lesion Localization in Computed Tomography Scans

    Get PDF
    The previous COVID-19 lung diagnosis system lacks both scientific validation and the role of explainable artificial intelligence (AI) for understanding lesion localization. This study presents a cloud-based explainable AI, the "COVLIAS 2.0-cXAI" system using four kinds of class activation maps (CAM) models.Our cohort consisted of ~6000 CT slices from two sources (Croatia, 80 COVID-19 patients and Italy, 15 control patients). COVLIAS 2.0-cXAI design consisted of three stages: (i) automated lung segmentation using hybrid deep learning ResNet-UNet model by automatic adjustment of Hounsfield units, hyperparameter optimization, and parallel and distributed training, (ii) classification using three kinds of DenseNet (DN) models (DN-121, DN-169, DN-201), and (iii) validation using four kinds of CAM visualization techniques: gradient-weighted class activation mapping (Grad-CAM), Grad-CAM++, score-weighted CAM (Score-CAM), and FasterScore-CAM. The COVLIAS 2.0-cXAI was validated by three trained senior radiologists for its stability and reliability. The Friedman test was also performed on the scores of the three radiologists.The ResNet-UNet segmentation model resulted in dice similarity of 0.96, Jaccard index of 0.93, a correlation coefficient of 0.99, with a figure-of-merit of 95.99%, while the classifier accuracies for the three DN nets (DN-121, DN-169, and DN-201) were 98%, 98%, and 99% with a loss of ~0.003, ~0.0025, and ~0.002 using 50 epochs, respectively. The mean AUC for all three DN models was 0.99 (p < 0.0001). The COVLIAS 2.0-cXAI showed 80% scans for mean alignment index (MAI) between heatmaps and gold standard, a score of four out of five, establishing the system for clinical settings.The COVLIAS 2.0-cXAI successfully showed a cloud-based explainable AI system for lesion localization in lung CT scans

    COVLIAS 2.0-cXAI: Cloud-Based Explainable Deep Learning System for COVID-19 Lesion Localization in Computed Tomography Scans

    Get PDF
    Background: The previous COVID-19 lung diagnosis system lacks both scientific validation and the role of explainable artificial intelligence (AI) for understanding lesion localization. This study presents a cloud-based explainable AI, the “COVLIAS 2.0-cXAI” system using four kinds of class activation maps (CAM) models. Methodology: Our cohort consisted of ~6000 CT slices from two sources (Croatia, 80 COVID-19 patients and Italy, 15 control patients). COVLIAS 2.0-cXAI design consisted of three stages: (i) automated lung segmentation using hybrid deep learning ResNet-UNet model by automatic adjustment of Hounsfield units, hyperparameter optimization, and parallel and distributed training, (ii) classification using three kinds of DenseNet (DN) models (DN-121, DN-169, DN-201), and (iii) validation using four kinds of CAM visualization techniques: gradient-weighted class activation mapping (Grad-CAM), Grad-CAM++, score-weighted CAM (Score-CAM), and FasterScore-CAM. The COVLIAS 2.0-cXAI was validated by three trained senior radiologists for its stability and reliability. The Friedman test was also performed on the scores of the three radiologists. Results: The ResNet-UNet segmentation model resulted in dice similarity of 0.96, Jaccard index of 0.93, a correlation coefficient of 0.99, with a figure-of-merit of 95.99%, while the classifier accuracies for the three DN nets (DN-121, DN-169, and DN-201) were 98%, 98%, and 99% with a loss of ~0.003, ~0.0025, and ~0.002 using 50 epochs, respectively. The mean AUC for all three DN models was 0.99 (p &lt; 0.0001). The COVLIAS 2.0-cXAI showed 80% scans for mean alignment index (MAI) between heatmaps and gold standard, a score of four out of five, establishing the system for clinical settings. Conclusions: The COVLIAS 2.0-cXAI successfully showed a cloud-based explainable AI system for lesion localization in lung CT scans
    • …
    corecore