Search CORE

16 research outputs found

Towards comprehensive syntactic and semantic annotations of the clinical narrative

Author: Albright Daniel
Choi Jinho D
Dligach Dmitriy
Fredriksen Anwen
Hwang Jena D
Lanfranchi Arrick
Martin James
Nielsen Rodney D
Palmer Martha
Savova Guergana K
Styler William F
Ward Wayne
Warner Colin
Publication venue: 'BMJ'
Publication date: 25/01/2013
Field of study

Objective: To create annotated clinical narratives with layers of syntactic and semantic labels to facilitate advances in clinical natural language processing (NLP). To develop NLP algorithms and open source components. Methods: Manual annotation of a clinical narrative corpus of 127 606 tokens following the Treebank schema for syntactic information, PropBank schema for predicate-argument structures, and the Unified Medical Language System (UMLS) schema for semantic information. NLP components were developed. Results: The final corpus consists of 13 091 sentences containing 1772 distinct predicate lemmas. Of the 766 newly created PropBank frames, 74 are verbs. There are 28 539 named entity (NE) annotations spread over 15 UMLS semantic groups, one UMLS semantic type, and the Person semantic category. The most frequent annotations belong to the UMLS semantic groups of Procedures (15.71%), Disorders (14.74%), Concepts and Ideas (15.10%), Anatomy (12.80%), Chemicals and Drugs (7.49%), and the UMLS semantic type of Sign or Symptom (12.46%). Inter-annotator agreement results: Treebank (0.926), PropBank (0.891–0.931), NE (0.697–0.750). The part-of-speech tagger, constituency parser, dependency parser, and semantic role labeler are built from the corpus and released open source. A significant limitation uncovered by this project is the need for the NLP community to develop a widely agreed-upon schema for the annotation of clinical concepts and their relations. Conclusions: This project takes a foundational step towards bringing the field of clinical NLP up to par with NLP in the general domain. The corpus creation and NLP components provide a resource for research and application development that would have been previously impossible

Crossref

Harvard University - DASH

PubMed Central

eScholarship - University of California

UNT Digital Library

Recommended from our members

To Annotate More Accurately or to Annotate More

Author: Dligach Dmitriy
Nielsen Rodney D.
Palmer Martha
Publication venue: Association for Computational Linguistics
Publication date
Field of study

This paper demonstrates that the greatest value for annotation lies in single annotating more data

UNT Digital Library

Recommended from our members

Exploring Text Representations for Generative Temporal Relation Extraction

Author: Bethard S.
Dligach D.
Miller T.
Savova G.
Publication venue: 'Association for Computational Linguistics (ACL)'
Publication date: 01/01/2022
Field of study

Sequence-to-sequence models are appealing because they allow both encoder and decoder to be shared across many tasks by formulating those tasks as text-to-text problems. Despite recently reported successes of such models, we find that engineering input/output representations for such text-to-text models is challenging. On the Clinical TempEval 2016 relation extraction task, the most natural choice of output representations, where relations are spelled out in simple predicate logic statements, did not lead to good performance. We explore a variety of input/output representations, with the most successful prompting one event at a time, and achieving results competitive with standard pairwise temporal relation extraction systems. © 2022 Association for Computational Linguistics.Open access journalThis item from the UA Faculty Publications collection is made available by the University of Arizona with support from the University of Arizona Libraries. If you have questions, please contact us at [email protected]

The University of Arizona

Recommended from our members

EntityBERT: Entity-centric Masking Strategy for Model Pretraining for the Clinical Domain

Author: Bethard S.
Dligach D.
Lin C.
Miller T.
Savova G.
Publication venue: 'Association for Computational Linguistics (ACL)'
Publication date: 01/01/2021
Field of study

Transformer-based neural language models have led to breakthroughs for a variety of natural language processing (NLP) tasks. However, most models are pretrained on general domain data. We propose a methodology to produce a model focused on the clinical domain: continued pretraining of a model with a broad representation of biomedical terminology (PubMedBERT) on a clinical corpus along with a novel entity-centric masking strategy to infuse domain knowledge in the learning process. We show that such a model achieves superior results on clinical extraction tasks by comparing our entity-centric masking strategy with classic random masking on three clinical NLP tasks: cross-domain negation detection (Wu et al., 2014), document time relation (DocTimeRel) classification (Lin et al., 2020b), and temporal relation extraction (Wright-Bettner et al., 2020). We also evaluate our models on the PubMedQA(Jin et al., 2019) dataset to measure the models’ performance on a nonentity-centric task in the biomedical domain. The language addressed in this work is English. © 2021 Association for Computational LinguisticsOpen access journalThis item from the UA Faculty Publications collection is made available by the University of Arizona with support from the University of Arizona Libraries. If you have questions, please contact us at [email protected]

The University of Arizona

Recommended from our members

Criteria for Manual Clustering of Verb Senses

Author: Brown Susan Windisch
Davis Jenny
Dligach Dimitry
Duffield Cecily Jill
Hwang Jena D.
Palmer Martha
Vieweg Sarah E.
Publication venue: eScholarship, University of California
Publication date: 01/01/2007
Field of study

eScholarship - University of California

Publicly available machine learning models for identifying opioid misuse from the clinical notes of hospitalized patients

Author: B Sharma (9849791)
C Joyce (9892913)
D Dligach (15078106)
E Salisbury-Afshar (15078112)
K Swope (15078109)
M Afshar (15078115)
Niranjan Karnik (14116680)
Publication venue
Publication date: 29/04/2020
Field of study

Background: Automated de-identification methods for removing protected health information (PHI) from the source notes of the electronic health record (EHR) rely on building systems to recognize mentions of PHI in text, but they remain inadequate at ensuring perfect PHI removal. As an alternative to relying on de-identification systems, we propose the following solutions: (1) Mapping the corpus of documents to standardized medical vocabulary (concept unique identifier [CUI] codes mapped from the Unified Medical Language System) thus eliminating PHI as inputs to a machine learning model; and (2) training character-based machine learning models that obviate the need for a dictionary containing input words/n-grams. We aim to test the performance of models with and without PHI in a use-case for an opioid misuse classifier. Methods: An observational cohort sampled from adult hospital inpatient encounters at a health system between 2007 and 2017. A case-control stratified sampling (n = 1000) was performed to build an annotated dataset for a reference standard of cases and non-cases of opioid misuse. Models for training and testing included CUI codes, character-based, and n-gram features. Models applied were machine learning with neural network and logistic regression as well as expert consensus with a rule-based model for opioid misuse. The area under the receiver operating characteristic curves (AUROC) were compared between models for discrimination. The Hosmer-Lemeshow test and visual plots measured model fit and calibration. Results: Machine learning models with CUI codes performed similarly to n-gram models with PHI. The top performing models with AUROCs > 0.90 included CUI codes as inputs to a convolutional neural network, max pooling network, and logistic regression model. The top calibrated models with the best model fit were the CUI-based convolutional neural network and max pooling network. The top weighted CUI codes in logistic regression has the related terms 'Heroin' and 'Victim of abuse'. Conclusions: We demonstrate good test characteristics for an opioid misuse computable phenotype that is void of any PHI and performs similarly to models that use PHI. Herein we share a PHI-free, trained opioid misuse classifier for other researchers and health systems to use and benchmark to overcome privacy and security concerns

University of Illinois at Chicago: UIC INDIGO (INtellectual property in DIGital form available online in an Open environment)

Natural language processing and machine learning to identify alcohol misuse from the electronic health record in trauma patients: development and internal validation

Author: A Phillips (816918)
C Joyce (9892913)
D Dligach (15078106)
D To (13587946)
J Mueller (9896825)
M Afshar (15078115)
Niranjan Karnik (14116680)
R Cooper (13320075)
R Gonzalez (10226327)
R Price (12907486)
Publication venue
Publication date: 01/03/2019
Field of study

Objective: Alcohol misuse is present in over a quarter of trauma patients. Information in the clinical notes of the electronic health record of trauma patients may be used for phenotyping tasks with natural language processing (NLP) and supervised machine learning. The objective of this study is to train and validate an NLP classifier for identifying patients with alcohol misuse. Materials and Methods: An observational cohort of 1422 adult patients admitted to a trauma center between April 2013 and November 2016. Linguistic processing of clinical notes was performed using the clinical Text Analysis and Knowledge Extraction System. The primary analysis was the binary classification of alcohol misuse. The Alcohol Use Disorders Identification Test served as the reference standard. Results: The data corpus comprised 91 045 electronic health record notes and 16 091 features. In the final machine learning classifier, 16 features were selected from the first 24 hours of notes for identifying alcohol misuse. The classifier's performance in the validation cohort had an area under the receiver-operating characteristic curve of 0.78 (95% confidence interval [CI], 0.72 to 0.85). Sensitivity and specificity were at 56.0% (95% CI, 44.1% to 68.0%) and 88.9% (95% CI, 84.4% to 92.8%). The Hosmer-Lemeshow goodness-of-fit test demonstrates the classifier fits the data well (P.17). A simpler rule-based keyword approach had a decrease in sensitivity when compared with the NLP classifier from 56.0% to 18.2%. Conclusions: The NLP classifier has adequate predictive validity for identifying alcohol misuse in trauma centers. External validation is needed before its application to augment screening

University of Illinois at Chicago: UIC INDIGO (INtellectual property in DIGital form available online in an Open environment)

IMPROVING WORD SENSE DISAMBIGUATION WITH AUTOMATICALLY RETRIEVED SEMANTIC KNOWLEDGE

Author: Bikel D. M.
DMITRIY DLIGACH
Fellbaum C.
Hanks P.
Harris Z. S.
MARTHA PALMER
Nivre J.
Palmer M.
Pradhan S.
Schutze H.
Publication venue: 'World Scientific Pub Co Pte Lt'
Publication date
Field of study

Crossref