20 research outputs found

    Improving the repeatability of deep learning models with Monte Carlo dropout

    No full text
    The integration of artificial intelligence into clinical workflows requires reliable and robust models. Repeatability is a key attribute of model robustness. Repeatable models output predictions with low variation during independent tests carried out under similar conditions. During model development and evaluation, much attention is given to classification performance while model repeatability is rarely assessed, leading to the development of models that are unusable in clinical practice. In this work, we evaluate the repeatability of four model types (binary classification, multi-class classification, ordinal classification, and regression) on images that were acquired from the same patient during the same visit. We study the performance of binary, multi-class, ordinal, and regression models on four medical image classification tasks from public and private datasets: knee osteoarthritis, cervical cancer screening, breast density estimation, and retinopathy of prematurity. Repeatability is measured and compared on ResNet and DenseNet architectures. Moreover, we assess the impact of sampling Monte Carlo dropout predictions at test time on classification performance and repeatability. Leveraging Monte Carlo predictions significantly increased repeatability for all tasks on the binary, multi-class, and ordinal models leading to an average reduction of the 95\% limits of agreement by 16% points and of the disagreement rate by 7% points. The classification accuracy improved in most settings along with the repeatability. Our results suggest that beyond about 20 Monte Carlo iterations, there is no further gain in repeatability. In addition to the higher test-retest agreement, Monte Carlo predictions were better calibrated which leads to output probabilities reflecting more accurately the true likelihood of being correctly classified.Comment: arXiv admin note: text overlap with arXiv:2111.0675

    Comparison of Accuracy and Reproducibility of Colposcopic Impression Based on a Single Image versus a Two-Minute Time Series of Colposcopic Images

    No full text
    OBJECTIVE: Colposcopy is an important part of cervical screening/management programs. Colposcopic appearance is often classified, for teaching and telemedicine, based on static images that do not reveal the dynamics of acetowhitening. We compared the accuracy and reproducibility of colposcopic impression based on a single image at one minute after application of acetic acid versus a time-series of 17 sequential images over two minutes. METHODS: Approximately 5000 colposcopic examinations conducted with the DYSIS colposcopic system were divided into 10 random sets, each assigned to a separate expert colposcopist. Colposcopists first classified single two-dimensional images at one minute and then a time-series of 17 sequential images as \u27normal,\u27 \u27indeterminate,\u27 \u27high grade,\u27 or \u27cancer\u27. Ratings were compared to histologic diagnoses. Additionally, 5 colposcopists reviewed a subset of 200 single images and 200 time series to estimate intra- and inter-rater reliability. RESULTS: Of 4640 patients with adequate images, only 24.4% were correctly categorized by single image visual assessment (11% of 64 cancers; 31% of 605 CIN3; 22.4% of 558 CIN2; 23.9% of 3412 \u3c CIN2). Individual colposcopist accuracy was low; Youden indices (sensitivity plus specificity minus one) ranged from 0.07 to 0.24. Use of the time-series increased the proportion of images classified as normal, regardless of histology. Intra-rater reliability was substantial (weighted kappa = 0.64); inter-rater reliability was fair ( weighted kappa = 0.26). CONCLUSION: Substantial variation exists in visual assessment of colposcopic images, even when a 17-image time series showing the two-minute process of acetowhitening is presented. We are currently evaluating whether deep-learning image evaluation can assist classification

    Validation in Zambia of a cervical screening strategy including HPV genotyping and artificial intelligence (AI)-based automated visual evaluation

    No full text
    Abstract Background WHO has recommended HPV testing for cervical screening where it is practical and affordable. If used, it is important to both clarify and implement the clinical management of positive results. We estimated the performance in Lusaka, Zambia of a novel screening/triage approach combining HPV typing with visual assessment assisted by a deep-learning approach called automated visual evaluation (AVE). Methods In this well-established cervical cancer screening program nested inside public sector primary care health facilities, experienced nurses examined women with high-quality digital cameras; the magnified illuminated images permit inspection of the surface morphology of the cervix and expert telemedicine quality assurance. Emphasizing sensitive criteria to avoid missing precancer/cancer, ~ 25% of women screen positive, reflecting partly the high HIV prevalence. Visual screen-positive women are treated in the same visit by trained nurses using either ablation (~ 60%) or LLETZ excision, or referred for LLETZ or more extensive surgery as needed. We added research elements (which did not influence clinical care) including collection of HPV specimens for testing and typing with BD Onclarity™ with a five channel output (HPV16, HPV18/45, HPV31/33/52/58, HPV35/39/51/56/59/66/68, human DNA control), and collection of triplicate cervical images with a Samsung Galaxy J8 smartphone camera™ that were analyzed using AVE, an AI-based algorithm pre-trained on a large NCI cervical image archive. The four HPV groups and three AVE classes were crossed to create a 12-level risk scale, ranking participants in order of predicted risk of precancer. We evaluated the risk scale and assessed how well it predicted the observed diagnosis of precancer/cancer. Results HPV type, AVE classification, and the 12-level risk scale all were strongly associated with degree of histologic outcome. The AVE classification showed good reproducibility between replicates, and added finer predictive accuracy to each HPV type group. Women living with HIV had higher prevalence of precancer/cancer; the HPV-AVE risk categories strongly predicted diagnostic findings in these women as well. Conclusions These results support the theoretical efficacy of HPV-AVE-based risk estimation for cervical screening. If HPV testing can be made affordable, cost-effective and point of care, this risk-based approach could be one management option for HPV-positive women
    corecore