1 research outputs found
SemClinBr -- a multi institutional and multi specialty semantically annotated corpus for Portuguese clinical NLP tasks
The high volume of research focusing on extracting patient's information from
electronic health records (EHR) has led to an increase in the demand for
annotated corpora, which are a very valuable resource for both the development
and evaluation of natural language processing (NLP) algorithms. The absence of
a multi-purpose clinical corpus outside the scope of the English language,
especially in Brazilian Portuguese, is glaring and severely impacts scientific
progress in the biomedical NLP field. In this study, we developed a
semantically annotated corpus using clinical texts from multiple medical
specialties, document types, and institutions. We present the following: (1) a
survey listing common aspects and lessons learned from previous research, (2) a
fine-grained annotation schema which could be replicated and guide other
annotation initiatives, (3) a web-based annotation tool focusing on an
annotation suggestion feature, and (4) both intrinsic and extrinsic evaluation
of the annotations. The result of this work is the SemClinBr, a corpus that has
1,000 clinical notes, labeled with 65,117 entities and 11,263 relations, and
can support a variety of clinical NLP tasks and boost the EHR's secondary use
for the Portuguese language