Test sets for assessing and improving radiologic image interpretation have been used for decades and typically evaluate performance relative to gold-standard interpretations by experts. To assess test sets for screening mammography, a gold-standard for whether a woman should be recalled for additional work-up is needed, given that interval cancers may be occult on mammography and some findings ultimately determined to be benign require additional imaging to determine if biopsy is warranted. Using experts to set a gold-standard assumes little variation occurs in their interpretations, but this has not been explicitly studied in mammography

Anderson, Melissa L.

Bassett, Lawrence

Bogart, Andy

Buist, Diana S.M.

Carney, Patricia A.

Geller, Berta

Kerlikowske, Karla

Miglioretti, Diana L.

Monsees, Barbara

Onega, Tracy

Sickles, Edward A.

Smith, Robert A.

Yankaskas, Bonnie C.

PubMed

Rationale and objectivesTest sets for assessing and improving radiologic image interpretation have been used for decades and typically evaluate performance relative to gold standard interpretations by experts. To assess test sets for screening mammography, a gold standard for whether a woman should be recalled for additional workup is needed, given that interval cancers may be occult on mammography and some findings ultimately determined to be benign require additional imaging to determine if biopsy is warranted. Using experts to set a gold standard assumes little variation occurs in their interpretations, but this has not been explicitly studied in mammography.Materials and methodsUsing digitized films from 314 screening mammography exams (n = 143 cancer cases) performed in the Breast Cancer Surveillance Consortium, we evaluated interpretive agreement among three expert radiologists who independently assessed whether each examination should be recalled, and the lesion location, finding type (mass, calcification, asymmetric density, or architectural distortion), and interpretive difficulty in the recalled images.ResultsAgreement among the three expert pairs for recall/no recall was higher for cancer cases (mean 74.3&nbsp;±&nbsp;6.5) than for noncancers (mean 62.6&nbsp;±&nbsp;7.1). Complete agreement on recall, lesion location, finding type and difficulty ranged from 36.4% to 42.0% for cancer cases and from 43.9% to 65.6% for noncancer cases. Two of three experts agreed on recall and lesion location for 95.1% of cancer cases and 91.8% of noncancer cases, but all three experts agreed on only 55.2% of cancer cases and 42.1% of noncancer cases.ConclusionVariability in expert interpretive is notable. A minimum of three independent experts combined with a consensus should be used for establishing any gold standard interpretation for test sets, especially for noncancer cases

Anderson, Melissa L

Miglioretti, Diana L

Buist, Diana SM

Smith, Robert A

Sickles, Edward A

Carney, Patricia A

Yankaskas, Bonnie C

eScholarship - University of California

English

Establishing a Gold Standard for Test Sets Variation in Interpretive Agreement of Expert Mammographers

Carolina Digital Repository

Establishing a Gold Standard for Test Sets. Variation in Interpretive Agreement of Expert Mammographers

http://dx.doi.org/10.1016/j.acra.2013.01.012

Establishing a Gold Standard for Test Sets. Variation in Interpretive Agreement of Expert Mammographers

Abstract

Similar works

Full text

Available Versions

eScholarship - University of California

Carolina Digital Repository