18 research outputs found
Agreement Between Experts and an Untrained Crowd for Identifying Dermoscopic Features Using a Gamified App: Reader Feasibility Study
Background
Dermoscopy is commonly used for the evaluation of pigmented lesions, but agreement between experts for identification of dermoscopic structures is known to be relatively poor. Expert labeling of medical data is a bottleneck in the development of machine learning (ML) tools, and crowdsourcing has been demonstrated as a cost- and time-efficient method for the annotation of medical images.
Objective
The aim of this study is to demonstrate that crowdsourcing can be used to label basic dermoscopic structures from images of pigmented lesions with similar reliability to a group of experts.
Methods
First, we obtained labels of 248 images of melanocytic lesions with 31 dermoscopic “subfeatures” labeled by 20 dermoscopy experts. These were then collapsed into 6 dermoscopic “superfeatures” based on structural similarity, due to low interrater reliability (IRR): dots, globules, lines, network structures, regression structures, and vessels. These images were then used as the gold standard for the crowd study. The commercial platform DiagnosUs was used to obtain annotations from a nonexpert crowd for the presence or absence of the 6 superfeatures in each of the 248 images. We replicated this methodology with a group of 7 dermatologists to allow direct comparison with the nonexpert crowd. The Cohen κ value was used to measure agreement across raters.
Results
In total, we obtained 139,731 ratings of the 6 dermoscopic superfeatures from the crowd. There was relatively lower agreement for the identification of dots and globules (the median κ values were 0.526 and 0.395, respectively), whereas network structures and vessels showed the highest agreement (the median κ values were 0.581 and 0.798, respectively). This pattern was also seen among the expert raters, who had median κ values of 0.483 and 0.517 for dots and globules, respectively, and 0.758 and 0.790 for network structures and vessels. The median κ values between nonexperts and thresholded average–expert readers were 0.709 for dots, 0.719 for globules, 0.714 for lines, 0.838 for network structures, 0.818 for regression structures, and 0.728 for vessels.
Conclusions
This study confirmed that IRR for different dermoscopic features varied among a group of experts; a similar pattern was observed in a nonexpert crowd. There was good or excellent agreement for each of the 6 superfeatures between the crowd and the experts, highlighting the similar reliability of the crowd for labeling dermoscopic images. This confirms the feasibility and dependability of using crowdsourcing as a scalable solution to annotate large sets of dermoscopic images, with several potential clinical and educational applications, including the development of novel, explainable ML tools
Human surface anatomy terminology for dermatology: a Delphi consensus from the International Skin Imaging Collaboration
BackgroundThere is no internationally vetted set of anatomic terms to describe human surface anatomy.ObjectiveTo establish expert consensus on a standardized set of terms that describe clinically relevant human surface anatomy.MethodsWe conducted a Delphi consensus on surface anatomy terminology between July 2017 and July 2019. The initial survey included 385 anatomic terms, organized in seven levels of hierarchy. If agreement exceeded the 75% established threshold, the term was considered - accepted- and included in the final list. Terms added by the participants were passed on to the next round of consensus. Terms with <75% agreement were included in subsequent surveys along with alternative terms proposed by participants until agreement was reached on all terms.ResultsThe Delphi included 21 participants. We found consensus (- ¥75% agreement) on 361/385 (93.8%) terms and eliminated one term in the first round. Of 49 new terms suggested by participants, 45 were added via consensus. To adjust for a recently published International Classification of Diseases- Surface Topography list of terms, a third survey including 111 discrepant terms was sent to participants. Finally, a total of 513 terms reached agreement via the Delphi method.ConclusionsWe have established a set of 513 clinically relevant terms for denoting human surface anatomy, towards the use of standardized terminology in dermatologic documentation.Linked Commentary: R.J.G. Chalmers. J Eur Acad Dermatol Venereol 2020; 34: 2456- 2457. https://doi.org/10.1111/jdv.16978.Peer Reviewedhttp://deepblue.lib.umich.edu/bitstream/2027.42/163915/1/jdv16855_am.pdfhttp://deepblue.lib.umich.edu/bitstream/2027.42/163915/2/jdv16855-sup-0001-FigS1-S3.pdfhttp://deepblue.lib.umich.edu/bitstream/2027.42/163915/3/jdv16855.pd
Dermoscopy/dermatoscopy and dermatopathology correlates of cutaneous neoplasms
Dermoscopy is increasingly used by clinicians (dermatologists, family physicians, podiatrists, doctors of osteopathic medicine, etc.) to inform clinical management decisions. Dermoscopic findings and/or images provided to pathologists offer an important insight into the clinician's diagnostic and management thought process. However, with limited dermoscopic training in dermatopathology, dermoscopic descriptions and images provided in the requisition form will provide little value to pathologists. Since most dermoscopic structures have direct histopathological correlates, dermoscopy can act as an excellent communication bridge between the clinician and the pathologist. In the first article of this continuing medical education series we review dermoscopic features and their histopathologic correlates
Usefulness of dermoscopy/dermatoscopy to improve the clinical and histopathologic diagnosis of skin cancers
Multiple studies have shown that dermoscopy increases the sensitivity and specificity for the detection of skin cancers compared to naked-eye examination. Dermoscopy can also lead to the detection of thinner and smaller cancers. Furthermore, dermoscopy leads to more precise selection of lesions requiring excision. In essence, dermoscopy helps clinicians differentiate benign from malignant lesions through the presence or absence of specific dermoscopic structures. Therefore, since most dermoscopic structures have direct histopathologic correlates, dermoscopy can allow the prediction of certain histologic findings present in skin cancers, thus helping select management and treatment options for select types of skin cancers. Visualizing dermoscopic structures in the ex vivo specimens can also be beneficial. It can improve the histologic diagnostic accuracy by targeted step-sectioning in areas of concern, which can be marked by the clinician before sending the specimen to the pathologist, or by the pathologist on the excised specimen in the laboratory. In addition, ex vivo dermoscopy can also be used to select tumor areas with genetic importance since some dermoscopic structures have been related to mutations with theragnostic relevance. In the second article of this continuing medical education series we review the impact of dermoscopy on the diagnostic accuracy of skin cancer, how can dermoscopy affect the histopathologic examination, and which dermoscopic features may be more relevant in terms of histological and genetic prediction
Reflectance confocal microscopy terminology glossary for nonmelanocytic skin lesions: A systematic review
Background: There is lack of uniformity in reflectance confocal microscopy (RCM) terminology for nonmelanocytic lesions (NMLs). Objective: To review published RCM terms for NMLs and identify likely synonymous terms. Methods: We conducted a systematic review of original research articles published up to August 19, 2017, adhering to Preferred Reporting Items for Systemic Reviews and Meta-Analyses guidelines. Two investigators gathered all published RCM terms used to describe basal cell carcinoma (BCC), squamous cell carcinoma (SCC), and seborrheic keratosis/solar lentigo/lichen planus–like keratosis (SK/SL/LPLK). Synonymous terms were grouped on the basis of similarity in definition and histopathologic correlates. Results: The inclusion criteria was met by 31 studies. Average frequency of use per term was 1.6 (range 1-8). By grouping synonymous terms, the number of terms could be reduced from 58 to 18 for BCC, 58 to 36 for SCC, 23 to 12 for SK/SL/LPLK, and from 139 to 66 terms (52.5% reduction) in total. The frequency of term usage stratified by anatomic layer (suprabasal epidermis vs epidermal basal layer, dermoepidermal junction, and superficial dermis) was 27 (25.7%) versus 78 (74.2%) for BCC; 60 (64.5%) versus 33 (34.5%) for SCC, and 15 (45.4%) versus 18 (54.5%) for SK/SL/LPLK, respectively. Limitations: Articles that were not peer reviewed were excluded. Conclusion: Systematic review of published RCM terms provides the basis for future NMLs terminology consensus
Agreement Between Experts and an Untrained Crowd for Identifying Dermoscopic Features Using a Gamified App: Reader Feasibility Study
BackgroundDermoscopy is commonly used for the evaluation of pigmented lesions, but agreement between experts for identification of dermoscopic structures is known to be relatively poor. Expert labeling of medical data is a bottleneck in the development of machine learning (ML) tools, and crowdsourcing has been demonstrated as a cost- and time-efficient method for the annotation of medical images.
ObjectiveThe aim of this study is to demonstrate that crowdsourcing can be used to label basic dermoscopic structures from images of pigmented lesions with similar reliability to a group of experts.
MethodsFirst, we obtained labels of 248 images of melanocytic lesions with 31 dermoscopic “subfeatures” labeled by 20 dermoscopy experts. These were then collapsed into 6 dermoscopic “superfeatures” based on structural similarity, due to low interrater reliability (IRR): dots, globules, lines, network structures, regression structures, and vessels. These images were then used as the gold standard for the crowd study. The commercial platform DiagnosUs was used to obtain annotations from a nonexpert crowd for the presence or absence of the 6 superfeatures in each of the 248 images. We replicated this methodology with a group of 7 dermatologists to allow direct comparison with the nonexpert crowd. The Cohen κ value was used to measure agreement across raters.
ResultsIn total, we obtained 139,731 ratings of the 6 dermoscopic superfeatures from the crowd. There was relatively lower agreement for the identification of dots and globules (the median κ values were 0.526 and 0.395, respectively), whereas network structures and vessels showed the highest agreement (the median κ values were 0.581 and 0.798, respectively). This pattern was also seen among the expert raters, who had median κ values of 0.483 and 0.517 for dots and globules, respectively, and 0.758 and 0.790 for network structures and vessels. The median κ values between nonexperts and thresholded average–expert readers were 0.709 for dots, 0.719 for globules, 0.714 for lines, 0.838 for network structures, 0.818 for regression structures, and 0.728 for vessels.
ConclusionsThis study confirmed that IRR for different dermoscopic features varied among a group of experts; a similar pattern was observed in a nonexpert crowd. There was good or excellent agreement for each of the 6 superfeatures between the crowd and the experts, highlighting the similar reliability of the crowd for labeling dermoscopic images. This confirms the feasibility and dependability of using crowdsourcing as a scalable solution to annotate large sets of dermoscopic images, with several potential clinical and educational applications, including the development of novel, explainable ML tools
Validation of artificial intelligence prediction models for skin cancer diagnosis using dermoscopy images: the 2019 International Skin Imaging Collaboration Grand Challenge
Background: Previous studies of artificial intelligence (AI) applied to dermatology have shown AI to have higher diagnostic classification accuracy than expert dermatologists; however, these studies did not adequately assess clinically realistic scenarios, such as how AI systems behave when presented with images of disease categories that are not included in the training dataset or images drawn from statistical distributions with significant shifts from training distributions. We aimed to simulate these real-world scenarios and evaluate the effects of image source institution, diagnoses outside of the training set, and other image artifacts on classification accuracy, with the goal of informing clinicians and regulatory agencies about safety and real-world accuracy. Methods: We designed a large dermoscopic image classification challenge to quantify the performance of machine learning algorithms for the task of skin cancer classification from dermoscopic images, and how this performance is affected by shifts in statistical distributions of data, disease categories not represented in training datasets, and imaging or lesion artifacts. Factors that might be beneficial to performance, such as clinical metadata and external training data collected by challenge participants, were also evaluated. 25 331 training images collected from two datasets (in Vienna [HAM10000] and Barcelona [BCN20000]) between Jan 1, 2000, and Dec 31, 2018, across eight skin diseases, were provided to challenge participants to design appropriate algorithms. The trained algorithms were then tested for balanced accuracy against the HAM10000 and BCN20000 test datasets and data from countries not included in the training dataset (Turkey, New Zealand, Sweden, and Argentina). Test datasets contained images of all diagnostic categories available in training plus other diagnoses not included in training data (not trained category). We compared the performance of the algorithms against that of 18 dermatologists in a simulated setting that reflected intended clinical use. Findings: 64 teams submitted 129 state-of-the-art algorithm predictions on a test set of 8238 images. The best performing algorithm achieved 58·8% balanced accuracy on the BCN20000 data, which was designed to better reflect realistic clinical scenarios, compared with 82·0% balanced accuracy on HAM10000, which was used in a previously published benchmark. Shifted statistical distributions and disease categories not included in training data contributed to decreases in accuracy. Image artifacts, including hair, pen markings, ulceration, and imaging source institution, decreased accuracy in a complex manner that varied based on the underlying diagnosis. When comparing algorithms to expert dermatologists (2460 ratings on 1269 images), algorithms performed better than experts in most categories, except for actinic keratoses (similar accuracy on average) and images from categories not included in training data (26% correct for experts vs 6% correct for algorithms, p<0·0001). For the top 25 submitted algorithms, 47·1% of the images from categories not included in training data were misclassified as malignant diagnoses, which would lead to a substantial number of unnecessary biopsies if current state-of-the-art AI technologies were clinically deployed. Interpretation: We have identified specific deficiencies and safety issues in AI diagnostic systems for skin cancer that should be addressed in future diagnostic evaluation protocols to improve safety and reliability in clinical practice. Funding: Melanoma Research Alliance and La Marató de TV3. © 2022 The Author(s). Published by Elsevier Ltd. This is an Open Access article under the CC BY-NC-ND 4.0 licens
Human surface anatomy terminology for dermatology: a Delphi consensus from the International Skin Imaging Collaboration
Background: There is no internationally vetted set of anatomic terms to describe human surface anatomy. Objective: To establish expert consensus on a standardized set of terms that describe clinically relevant human surface anatomy. Methods: We conducted a Delphi consensus on surface anatomy terminology between July 2017 and July 2019. The initial survey included 385 anatomic terms, organized in seven levels of hierarchy. If agreement exceeded the 75% established threshold, the term was considered ‘accepted’ and included in the final list. Terms added by the participants were passed on to the next round of consensus. Terms with <75% agreement were included in subsequent surveys along with alternative terms proposed by participants until agreement was reached on all terms. Results: The Delphi included 21 participants. We found consensus (≥75% agreement) on 361/385 (93.8%) terms and eliminated one term in the first round. Of 49 new terms suggested by participants, 45 were added via consensus. To adjust for a recently published International Classification of Diseases-Surface Topography list of terms, a third survey including 111 discrepant terms was sent to participants. Finally, a total of 513 terms reached agreement via the Delphi method. Conclusions: We have established a set of 513 clinically relevant terms for denoting human surface anatomy, towards the use of standardized terminology in dermatologic documentation. © 2020 European Academy of Dermatology and Venereolog