Natural language processing (NLP) shows promise as a means to automate the
labelling of hospital-scale neuroradiology magnetic resonance imaging (MRI)
datasets for computer vision applications. To date, however, there has been no
thorough investigation into the validity of this approach, including
determining the accuracy of report labels compared to image labels as well as
examining the performance of non-specialist labellers. In this work, we draw on
the experience of a team of neuroradiologists who labelled over 5000 MRI
neuroradiology reports as part of a project to build a dedicated deep
learning-based neuroradiology report classifier. We show that, in our
experience, assigning binary labels (i.e. normal vs abnormal) to images from
reports alone is highly accurate. In contrast to the binary labels, however,
the accuracy of more granular labelling is dependent on the category, and we
highlight reasons for this discrepancy. We also show that downstream model
performance is reduced when labelling of training reports is performed by a
non-specialist. To allow other researchers to accelerate their research, we
make our refined abnormality definitions and labelling rules available, as well
as our easy-to-use radiology report labelling app which helps streamline this
process