4 research outputs found
J Biomed Inform
We followed a systematic approach based on the Preferred Reporting Items for Systematic Reviews and Meta-Analyses to identify existing clinical natural language processing (NLP) systems that generate structured information from unstructured free text. Seven literature databases were searched with a query combining the concepts of natural language processing and structured data capture. Two reviewers screened all records for relevance during two screening phases, and information about clinical NLP systems was collected from the final set of papers. A total of 7149 records (after removing duplicates) were retrieved and screened, and 86 were determined to fit the review criteria. These papers contained information about 71 different clinical NLP systems, which were then analyzed. The NLP systems address a wide variety of important clinical and research tasks. Certain tasks are well addressed by the existing systems, while others remain as open challenges that only a small number of systems attempt, such as extraction of temporal information or normalization of concepts to standard terminologies. This review has identified many NLP systems capable of processing clinical free text and generating structured output, and the information collected and evaluated here will be important for prioritizing development of new approaches for clinical NLP.CC999999/ImCDC/Intramural CDC HHS/United States2019-11-20T00:00:00Z28729030PMC6864736694
Recommended from our members
A modular, open-source information extraction framework for identifying clinical concepts and processes of care in clinical narratives
In this thesis, a synthesis is presented of the knowledge models required by clinical informa- tion systems that provide decision support for longitudinal processes of care. Qualitative research techniques and thematic analysis are novelly applied to a systematic review of the literature on the challenges in implementing such systems, leading to the development of an original conceptual framework. The thesis demonstrates how these process-oriented systems make use of a knowledge base derived from workflow models and clinical guidelines, and argues that one of the major barriers to implementation is the need to extract explicit and implicit information from diverse resources in order to construct the knowledge base. Moreover, concepts in both the knowledge base and in the electronic health record (EHR) must be mapped to a common ontological model. However, the majority of clinical guideline information remains in text form, and much of the useful clinical information residing in the EHR resides in the free text fields of progress notes and laboratory reports. In this thesis, it is shown how natural language processing and information extraction techniques provide a means to identify and formalise the knowledge components required by the knowledge base. Original contributions are made in the development of lexico-syntactic patterns and the use of external domain knowledge resources to tackle a variety of information extraction tasks in the clinical domain, such as recognition of clinical concepts, events, temporal relations, term disambiguation and abbreviation expansion. Methods are developed for adapting existing tools and resources in the biomedical domain to the processing of clinical texts, and approaches to improving the scalability of these tools are proposed and evalu- ated. These tools and techniques are then combined in the creation of a novel approach to identifying processes of care in the clinical narrative. It is demonstrated that resolution of coreferential and anaphoric relations as narratively and temporally ordered chains provides a means to extract linked narrative events and processes of care from clinical notes. Coreference performance in discharge summaries and progress notes is largely dependent on correct identification of protagonist chains (patient, clinician, family relation), pronominal resolution, and string matching that takes account of experiencer, temporal, spatial, and anatomical context; whereas for laboratory reports additional, external domain knowledge is required. The types of external knowledge and their effects on system performance are identified and evaluated. Results are compared against existing systems for solving these tasks and are found to improve on them, or to approach the performance of recently reported, state-of-the- art systems. Software artefacts developed in this research have been made available as open-source components within the General Architecture for Text Engineering framework
Recommended from our members
Ontology-based Semantic Harmonization of HIV-associated Common Data Elements for Integration of Diverse HIV Research Datasets
Analysis of integrated, diverse, Human Immunodeficiency Virus (HIV)-associated datasets can increase knowledge and guide the development of novel and effective interventions for disease prevention and treatment by increasing breadth of variables and statistical power, particularly for sub-group analyses. This topic has been identified as a National Institutes of Health research priority, but few efforts have been made to integrate data across HIV studies. Our aims were to: 1) Characterize the semantic heterogeneity (SH) in the HIV research domain; 2) Identify HIV-associated common data elements (CDEs) in empirically generated and knowledge-based resources; 3) Create a formal representation of HIV-associated CDEs in the form of an HIV-associated Entities in Research Ontology (HERO); 4) Assess the feasibility of using HERO to semantically harmonize HIV research data. Our approach was guided by information/knowledge theory and the DIKW (Data Information Knowledge Wisdom) hierarchical model.
Our systematized review of the literature revealed that synergistic use of both ontologies and CDEs included integration, interoperability, data exchange, and data standardization. Moreover, methods and tools included use of experts for CDE identification, the Unified Medical Language System, natural language processing, Extensible Markup Language, Health Level 7, and ontology development tools (e.g., Protégé). Additionally, evaluation methods included expert assessment, quantification of mapping tasks between raters, assessment of interrater reliability, and comparison to established standards. We used these findings to inform our process for achieving the study aims.
For Aim 1, we analyzed eight disparate HIV-associated data dictionaries and developed a String Metric-assisted Assessment of Semantic Heterogeneity (SMASH) method, which aided identification of 127 (13%) homogeneous data element (DE) pairs and 1,048 (87%) semantically heterogeneous DE pairs. Most heterogeneous pairs (97%) were semantically-equivalent/syntactically-different, allowing us to determine that SH in the HIV research domain was high.
To achieve Aim 2, we used Clinicaltrials.gov, Google Search, and text mining in R to identify HIV-associated CDEs in HIV journal articles, HIV-associated datasets, AIDSinfo HIV/AIDS Glossary, AIDSinfo Drug Database, Logical Observation Identifiers Names and Codes (LOINC), Systematized Nomenclature of Medicine (SNOMED), and RxNORM (understood as prescription normalization). Two HIV experts then manually reviewed DEs from the journal articles and data dictionaries to confirm DE commonality and resolved semantic discrepancies through discussion. Ultimately, we identified 2,179 unique CDEs. Of all CDEs, data-driven approaches identified 2,055 (94%) (999 from the HIV/AIDS Glossary, 398 from the Drug Database, 91 from journal articles, and a total of 567 from LOINC, SNOMED, and RxNorm cumulatively). Expert-based approaches identified 124 (6%) unique CDEs from data dictionaries and confirmed the 91 CDEs from journal articles.
In Aim 3, we used the Protégé suite of ontology development tools and the 2,179 CDEs to develop the HERO. We modeled the ontology using the semantic structure of the Medical Entities Dictionary, available hierarchical information from the CDE knowledge resources, and expert knowledge. The ontology fulfilled most relevant criteria from Cimino’s desiderata and OntoClean ontology engineering principles, and it successfully answered eight competency questions.
Finally, for Aim 4, we assessed the feasibility of using HERO to semantically harmonize and integrate the data dictionaries from two diverse HIV-associated datasets. Two HIV experts involved in the development of HERO independently assessed each data dictionary. Of the 367 DEs in data dictionary 1 (D1), 181 (49.32%) were identified as CDEs and 186 (50.68%) were not CDEs, and of the 72 DEs in data dictionary 2 (D2), 37 (51.39%) were CDEs and 35 (48.61%) were not CDEs. The HIV experts then traversed HERO’s hierarchy to map CDEs from D1 and D2 to CDEs in HERO. Of the 181 CDEs in D1, 156 (86.19%) were found in HERO, and 25 (13.81%) were not. Similarly, of the 37 CDEs in D2 32 (86.48%) were found in HERO, and 5 (13.51%) were not. Interrater reliability for CDE identification as measured by Cohen’s Kappa was 0.900 for D1 and 0.892 for D2. Cohen’s Kappas for CDEs in D1 and D2 that were also identified in HERO were 0.885 and 0.688, respectively.
Subsequently, to demonstrate the integration of the two HIV-associated datasets, a sample of semantically harmonized CDEs in both datasets was categorically selected (e.g. administrative, demographic, and behavioral), and D2 sample size increases were calculated for race (e.g., White, African American/Black, Asian/Pacific Islander, Native American/Indian, and Hispanic/Latino) and for “intravenous drug use” from the integrated datasets. The average increase of D2 CDEs for six selected CDEs was 1,928%.
Despite the limitation of HERO developers also serving as evaluators, the contributions of the study to the fields of informatics and HIV research were substantial. Confirmatory contributions include: identification of effective CDE/ontology tools, and use of data-driven and expert-based methods. Novel contributions include: development of SMASH and HERO; and new contributions include documenting that SH is high in HIV-associated datasets, identifying 2,179 HIV-associated CDEs, creating two additional classifications of SH, and showing that using HERO for semantic harmonization of HIV-associated data dictionaries is feasible. Our future work will build upon this research by expanding the numbers and types of datasets, refining our methods and tools, and conducting an external evaluation
Extracção de informação médica em português europeu
Doutoramento em Engenharia InformáticaThe electronic storage of medical patient data is becoming a daily experience
in most of the practices and hospitals worldwide. However, much of the data
available is in free-form text, a convenient way of expressing concepts and
events, but especially challenging if one wants to perform automatic searches,
summarization or statistical analysis. Information Extraction can relieve some of
these problems by offering a semantically informed interpretation and
abstraction of the texts.
MedInX, the Medical Information eXtraction system presented in this document,
is the first information extraction system developed to process textual clinical
discharge records written in Portuguese. The main goal of the system is to
improve access to the information locked up in unstructured text, and,
consequently, the efficiency of the health care process, by allowing faster and
reliable access to quality information on health, for both patient and health
professionals.
MedInX components are based on Natural Language Processing principles,
and provide several mechanisms to read, process and utilize external
resources, such as terminologies and ontologies, in the process of automatic
mapping of free text reports onto a structured representation.
However, the flexible and scalable architecture of the system, also allowed its
application to the task of Named Entity Recognition on a shared evaluation
contest focused on Portuguese general domain free-form texts.
The evaluation of the system on a set of authentic hospital discharge letters
indicates that the system performs with 95% F-measure, on the task of entity
recognition, and 95% precision on the task of relation extraction.
Example applications, demonstrating the use of MedInX capabilities in real
applications in the hospital setting, are also presented in this document. These
applications were designed to answer common clinical problems related with
the automatic coding of diagnoses and other health-related conditions
described in the documents, according to the international classification
systems ICD-9-CM and ICF. The automatic review of the content and
completeness of the documents is an example of another developed
application, denominated MedInX Clinical Audit system.O armazenamento electrónico dos dados médicos do paciente é uma prática
cada vez mais comum nos hospitais e clínicas médicas de todo o mundo. No
entanto, a maior parte destes dados são disponibilizados sob a forma de texto
livre, uma forma conveniente de expressar conceitos e termos mas
particularmente desafiante quando se pretende realizar procuras, sumarização
ou análise estatística de uma forma automática. As tecnologias de extracção
automática de informação podem ajudar a solucionar alguns destes problemas
através da interpretação semântica e da abstracção do conteúdo dos textos.
O sistema de Extracção de Informação Médica apresentado neste documento,
o MedInX, é o primeiro sistema desenvolvido para o processamento de cartas
de alta hospitalar escritas em Português. O principal objectivo do sistema é a
melhoria do acesso à informação trancada nos textos e, consequentemente, a
melhoria da eficiência dos cuidados de saúde, através do acesso rápido e
confiável à informação, quer relativa ao doente, quer aos profissionais de
saúde.
O MedInX utiliza diversas componentes, baseadas em princípios de
processamento de linguagem natural, para a análise dos textos clínicos, e
contém vários mecanismos para ler, processar e utilizar recursos externos,
como terminologias e ontologias. Este recursos são utilizados, em particular,
no mapeamento automático do texto livre para uma representação estruturada.
No entanto, a arquitectura flexível e escalável do sistema permitiu, também, a
sua aplicação na tarefa de Reconhecimento de Entidades Nomeadas numa
avaliação conjunta relativa ao processamento de textos de domínio geral,
escritos em Português.
A avaliação do sistema num conjunto de cartas de alta hospitalar reais, indica
que o sistema realiza a tarefa de extracção de informação com uma medida F
de 95% e a tarefa de extracção de relações com uma precisão de 95%.
A utilidade do sistema em aplicações reais é demonstrada através do
desenvolvimento de um conjunto de projectos exemplificativos, que pretendem
responder a problemas concretos e comuns em ambiente hospitalar. Estes
problemas estão relacionados com a codificação automática de diagnósticos e
de outras condições relacionadas com o estado de saúde do doente, seguindo
as classificações internacionais, ICD-9-CM e ICF. A revisão automática do
conteúdo dos documentos é outro exemplo das possíveis aplicações práticas
do sistema. Esta última aplicação é representada pelo o sistema de auditoria
do MedInX