Search CORE

145 research outputs found

Developing a manually annotated clinical document corpus to identify phenotypic information for inflammatory bowel disease

Author: A Roberts
Adi V Gundlapalli
AV Gundlapalli
Brett R South
CR Weir
EM Fielstein
G Hripcsak
G Hripcsak
HJ Tange
Jennifer Garvin
JF Penz
JJ Cimino
MA Musen
Makoto Jones
Matthew H Samore
PV Ogren
RH Dolin
S Brown
SH Brown
SH Brown
Shuying Shen
SM Meystre
V Kashyap
W Chapman
Wendy W Chapman
WW Chapman
Publication venue: BioMed Central
Publication date: 01/01/2009
Field of study

Abstract Background Natural Language Processing (NLP) systems can be used for specific Information Extraction (IE) tasks such as extracting phenotypic data from the electronic medical record (EMR). These data are useful for translational research and are often found only in free text clinical notes. A key required step for IE is the manual annotation of clinical corpora and the creation of a reference standard for (1) training and validation tasks and (2) to focus and clarify NLP system requirements. These tasks are time consuming, expensive, and require considerable effort on the part of human reviewers. Methods Using a set of clinical documents from the VA EMR for a particular use case of interest we identify specific challenges and present several opportunities for annotation tasks. We demonstrate specific methods using an open source annotation tool, a customized annotation schema, and a corpus of clinical documents for patients known to have a diagnosis of Inflammatory Bowel Disease (IBD). We report clinician annotator agreement at the document, concept, and concept attribute level. We estimate concept yield in terms of annotated concepts within specific note sections and document types. Results Annotator agreement at the document level for documents that contained concepts of interest for IBD using estimated Kappa statistic (95% CI) was very high at 0.87 (0.82, 0.93). At the concept level, F-measure ranged from 0.61 to 0.83. However, agreement varied greatly at the specific concept attribute level. For this particular use case (IBD), clinical documents producing the highest concept yield per document included GI clinic notes and primary care notes. Within the various types of notes, the highest concept yield was in sections representing patient assessment and history of presenting illness. Ancillary service documents and family history and plan note sections produced the lowest concept yield. Conclusion Challenges include defining and building appropriate annotation schemas, adequately training clinician annotators, and determining the appropriate level of information to be annotated. Opportunities include narrowing the focus of information extraction to use case specific note types and sections, especially in cases where NLP systems will be used to extract information from large repositories of electronic clinical note documents.</p

Crossref

Springer - Publisher Connector

Directory of Open Access Journals

PubMed Central

University of Melbourne Institutional Repository

Supporting the annotation of chronic obstructive pulmonary disease (COPD) phenotypes with text mining workflows

Author
Publication venue: BioMed Central
Publication date
Field of study

Springer - Publisher Connector

Annotating a corpus of clinical text records for learning to recognize symptoms automatically

Author: Carroll John
Koeling Rob
Nicholson Amanda
Tate Rosemary
Publication venue: 'Norwegian University of Science and Technology (NTNU) Library'
Publication date: 17/08/2011
Field of study

We report on a research effort to create a corpus of clinical free text records enriched with annotation for symptoms of a particular disease (ovarian cancer). We describe the original data, the annotation procedure and the resulting corpus. The data (approximately 192K words) was annotated by three clinicians and a procedure was devised to resolve disagreements. We are using the corpus to investigate the amount of symptom-related information in clinical records that is not coded, and to develop techniques for recognizing these symptoms automatically in unseen text

Sussex Research Online

BMC Bioinformatics

Author
Publication venue
Publication date
Field of study

BackgroundNatural Language Processing (NLP) systems can be used for specific Information Extraction (IE) tasks such as extracting phenotypic data from the electronic medical record (EMR). These data are useful for translational research and are often found only in free text clinical notes. A key required step for IE is the manual annotation of clinical corpora and the creation of a reference standard for (1) training and validation tasks and (2) to focus and clarify NLP system requirements. These tasks are time consuming, expensive, and require considerable effort on the part of human reviewers.MethodsUsing a set of clinical documents from the VA EMR for a particular use case of interest we identify specific challenges and present several opportunities for annotation tasks. We demonstrate specific methods using an open source annotation tool, a customized annotation schema, and a corpus of clinical documents for patients known to have a diagnosis of Inflammatory Bowel Disease (IBD). We report clinician annotator agreement at the document, concept, and concept attribute level. We estimate concept yield in terms of annotated concepts within specific note sections and document types.ResultsAnnotator agreement at the document level for documents that contained concepts of interest for IBD using estimated Kappa statistic (95% CI) was very high at 0.87 (0.82, 0.93). At the concept level, F-measure ranged from 0.61 to 0.83. However, agreement varied greatly at the specific concept attribute level. For this particular use case (IBD), clinical documents producing the highest concept yield per document included GI clinic notes and primary care notes. Within the various types of notes, the highest concept yield was in sections representing patient assessment and history of presenting illness. Ancillary service documents and family history and plan note sections produced the lowest concept yield.ConclusionChallenges include defining and building appropriate annotation schemas, adequately training clinician annotators, and determining the appropriate level of information to be annotated. Opportunities include narrowing the focus of information extraction to use case specific note types and sections, especially in cases where NLP systems will be used to extract information from large repositories of electronic clinical note documents.1 PO1 CD000284-01/CD/ODCDC CDC HHS/United States19761566PMC274568

CDC Stacks

A Systematic Review of Natural Language Processing for Knowledge Management in Healthcare

Author: Basyal Ganga Prasad
Rimal Bhaskar P.
Zeng David
Publication venue: 'Academy and Industry Research Collaboration Center (AIRCC)'
Publication date: 17/07/2020
Field of study

Driven by the visions of Data Science, recent years have seen a paradigm shift in Natural Language Processing (NLP). NLP has set the milestone in text processing and proved to be the preferred choice for researchers in the healthcare domain. The objective of this paper is to identify the potential of NLP, especially, how NLP is used to support the knowledge management process in the healthcare domain, making data a critical and trusted component in improving the health outcomes. This paper provides a comprehensive survey of the state-of-the-art NLP research with a particular focus on how knowledge is created, captured, shared, and applied in the healthcare domain. Our findings suggest, first, the techniques of NLP those supporting knowledge management extraction and knowledge capture processes in healthcare. Second, we propose a conceptual model for the knowledge extraction process through NLP. Finally, we discuss a set of issues, challenges, and proposed future research areas

arXiv.org e-Print Archive

Crossref

A Systematic Review of Natural Language Processing for Knowledge Management in Healthcare

Author: Basyal Ganga Prasad
Rimal Bhaskar P., Dr.
Zeng David
Publication venue: Beadle Scholar
Publication date: 17/07/2020
Field of study

Driven by the visions of Data Science, recent years have seen a paradigm shift in Natural Language Processing (NLP). NLP has set the milestone in text processing and proved to be the preferred choice for researchers in the healthcare domain. The objective of this paper is to identify the potential of NLP, especially, how NLP is used to support the knowledge management process in the healthcare domain, making data a critical and trusted component in improving health outcomes. This paper provides a comprehensive survey of the state-of-the-art NLP research with a particular focus on how knowledge is created, captured, shared, and applied in the healthcare domain. Our findings suggest, first, the techniques of NLP those supporting knowledge management extraction and knowledge capture processes in healthcare. Second, we propose a conceptual model for the knowledge extraction process through NLP. Finally, we discuss a set of issues, challenges, and proposed future research areas

Beadle Scholar at Dakota State University

Selected proceedings of the 2009 Summit on Translational Bioinformatics

Author: A Grossman
A Janevski
AK Dean
B Jin
BR South
C Overby
E Lee
E Pitzer
EA Zerhouni
H Chang
Indra Neil Sarkar
JL Lustgarten
NH Shah
P Shivakumar
R Lacson
SK Bhavnani
TH Shen
X Wang
X Yang
Yves A Lussier
Publication venue: BioMed Central
Publication date: 01/01/2009
Field of study

Crossref

Springer - Publisher Connector

Directory of Open Access Journals

PubMed Central

Linking Clinical Records to the Biomedical Literature

Author: Alnazzawi Noha
Publication venue
Publication date: 31/12/2016
Field of study

The University of Manchester - Institutional Repository

PhenoDEF: a corpus for annotating sentences with information of phenotype definitions in biomedical literature

Author: Binkheder Samar
Chiang Chien-Wei
Jones Josette
Li Lang
Quinney Sara K.
Wang Lei
Wu Heng-Yi
Zhang Shijun
Zitu Md. Muntasir
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2022
Field of study

Background Adverse events induced by drug-drug interactions are a major concern in the United States. Current research is moving toward using electronic health record (EHR) data, including for adverse drug events discovery. One of the first steps in EHR-based studies is to define a phenotype for establishing a cohort of patients. However, phenotype definitions are not readily available for all phenotypes. One of the first steps of developing automated text mining tools is building a corpus. Therefore, this study aimed to develop annotation guidelines and a gold standard corpus to facilitate building future automated approaches for mining phenotype definitions contained in the literature. Furthermore, our aim is to improve the understanding of how these published phenotype definitions are presented in the literature and how we annotate them for future text mining tasks. Results Two annotators manually annotated the corpus on a sentence-level for the presence of evidence for phenotype definitions. Three major categories (inclusion, intermediate, and exclusion) with a total of ten dimensions were proposed characterizing major contextual patterns and cues for presenting phenotype definitions in published literature. The developed annotation guidelines were used to annotate the corpus that contained 3971 sentences: 1923 out of 3971 (48.4%) for the inclusion category, 1851 out of 3971 (46.6%) for the intermediate category, and 2273 out of 3971 (57.2%) for exclusion category. The highest number of annotated sentences was 1449 out of 3971 (36.5%) for the “Biomedical & Procedure” dimension. The lowest number of annotated sentences was 49 out of 3971 (1.2%) for “The use of NLP”. The overall percent inter-annotator agreement was 97.8%. Percent and Kappa statistics also showed high inter-annotator agreement across all dimensions. Conclusions The corpus and annotation guidelines can serve as a foundational informatics approach for annotating and mining phenotype definitions in literature, and can be used later for text mining applications

IUPUIScholarWorks

PubMed Central

Biomedical Literature Mining and Knowledge Discovery of Phenotyping Definitions

Author: Binkheder Samar Hussein
Publication venue
Publication date: 01/07/2019
Field of study

Indiana University-Purdue University Indianapolis (IUPUI)Phenotyping definitions are essential in cohort identification when conducting clinical research, but they become an obstacle when they are not readily available. Developing new definitions manually requires expert involvement that is labor-intensive, time-consuming, and unscalable. Moreover, automated approaches rely mostly on electronic health records’ data that suffer from bias, confounding, and incompleteness. Limited efforts established in utilizing text-mining and data-driven approaches to automate extraction and literature-based knowledge discovery of phenotyping definitions and to support their scalability. In this dissertation, we proposed a text-mining pipeline combining rule-based and machine-learning methods to automate retrieval, classification, and extraction of phenotyping definitions’ information from literature. To achieve this, we first developed an annotation guideline with ten dimensions to annotate sentences with evidence of phenotyping definitions' modalities, such as phenotypes and laboratories. Two annotators manually annotated a corpus of sentences (n=3,971) extracted from full-text observational studies’ methods sections (n=86). Percent and Kappa statistics showed high inter-annotator agreement on sentence-level annotations. Second, we constructed two validated text classifiers using our annotated corpora: abstract-level and full-text sentence-level. We applied the abstract-level classifier on a large-scale biomedical literature of over 20 million abstracts published between 1975 and 2018 to classify positive abstracts (n=459,406). After retrieving their full-texts (n=120,868), we extracted sentences from their methods sections and used the full-text sentence-level classifier to extract positive sentences (n=2,745,416). Third, we performed a literature-based discovery utilizing the positively classified sentences. Lexica-based methods were used to recognize medical concepts in these sentences (n=19,423). Co-occurrence and association methods were used to identify and rank phenotype candidates that are associated with a phenotype of interest. We derived 12,616,465 associations from our large-scale corpus. Our literature-based associations and large-scale corpus contribute in building new data-driven phenotyping definitions and expanding existing definitions with minimal expert involvement

IUPUIScholarWorks