23 research outputs found
Information extraction from medication leaflets
Tese de mestrado integrado. Engenharia Informática e Computação. Faculdade de Engenharia. Universidade do Porto. 201
Clinical records anonymisation and text extraction (CRATE): an open-source software system.
BACKGROUND: Electronic medical records contain information of value for research, but contain identifiable and often highly sensitive confidential information. Patient-identifiable information cannot in general be shared outside clinical care teams without explicit consent, but anonymisation/de-identification allows research uses of clinical data without explicit consent. RESULTS: This article presents CRATE (Clinical Records Anonymisation and Text Extraction), an open-source software system with separable functions: (1) it anonymises or de-identifies arbitrary relational databases, with sensitivity and precision similar to previous comparable systems; (2) it uses public secure cryptographic methods to map patient identifiers to research identifiers (pseudonyms); (3) it connects relational databases to external tools for natural language processing; (4) it provides a web front end for research and administrative functions; and (5) it supports a specific model through which patients may consent to be contacted about research. CONCLUSIONS: Creation and management of a research database from sensitive clinical records with secure pseudonym generation, full-text indexing, and a consent-to-contact process is possible and practical using entirely free and open-source software.The project was funded in part by the UK National Institute of Health Research Cambridge Biomedical Research Centre. The work was conducted within the Behavioural and Clinical Neuroscience Institute, University of Cambridge, supported by the Wellcome Trust and the UK Medical Research Council
Processamento automático de texto de narrativas clínicas
The informatization of medical systems and the subsequent move towards
the usage of Electronic Health Records (EHR) over the paper format by
medical professionals allowed for safer and more e cient healthcare. Additionally,
EHR can also be used as a data source for observational studies
around the world. However, it is estimated that 70-80% of all clinical data
is in the form of unstructured free text and regarding the data that is structured,
not all of it follows the same standards, making it di cult to use on
the mentioned observational studies.
This dissertation aims to tackle those two adversities using natural language
processing for the task of extracting concepts from free text and, afterwards,
use a common data model to harmonize the data. The developed system
employs an annotator, namely cTAKES, to extract the concepts from free
text. The extracted concepts are then normalized using text preprocessing,
word embeddings, MetaMap and UMLS Metathesaurus lookup. Finally, the
normalized concepts are converted to the OMOP Common Data Model and
stored in a database.
In order to test the developed system, the i2b2 2010 data set was used.
The di erent components of the system were tested and evaluated separately,
with the concept extraction component achieving a precision, recall
and F-score of 77.12%, 70.29% and 73.55%, respectively. The normalization
component was evaluated by completing the N2C2 2019 challenge
track 3, where it achieved a 77.5% accuracy. Finally, during the OMOP
CDM conversion component, it was observed that 7.92% of the concepts
were lost during the process. In conclusion, even though the developed system
still has margin for improvements, it proves to be a viable method of
automatically processing clinical narratives.A informatização dos sistemas médicos e a subsequente tendência por parte
de profissionais de saúde a substituir registos em formato de papel por registos
eletrónicos de saúde, permitiu que os serviços de saúde se tornassem
mais seguros e eficientes. Além disso, estes registos eletrónicos apresentam
também o benefício de poderem ser utilizados como fonte de dados para estudos
observacionais. No entanto, estima-se que 70-80% de todos os dados
clínicos se encontrem na forma de texto livre não-estruturado e os dados
que estão estruturados não seguem todos os mesmos padrões, dificultando
o seu potencial uso nos estudos observacionais.
Esta dissertação pretende solucionar essas duas adversidades através do uso
de processamento de linguagem natural para a tarefa de extrair conceitos
de texto livre e, de seguida, usar um modelo comum de dados para os harmonizar.
O sistema desenvolvido utiliza um anotador, especificamente o
cTAKES, para extrair conceitos de texto livre. Os conceitos extraídos são,
então, normalizados através de técnicas de pré-processamento de texto,
Word Embeddings, MetaMap e um sistema de procura no Metathesaurus
do UMLS. Por fim, os conceitos normalizados são convertidos para o modelo
comum de dados da OMOP e guardados numa base de dados.
Para testar o sistema desenvolvido usou-se o conjunto de dados i2b2 de
2010. As diferentes partes do sistema foram testadas e avaliadas individualmente
sendo que na extração dos conceitos obteve-se uma precisão, recall e
F-score de 77.12%, 70.29% e 73.55%, respetivamente. A normalização foi
avaliada através do desafio N2C2 2019-track 3 onde se obteve uma exatidão
de 77.5%. Na conversão para o modelo comum de dados OMOP observou-se
que durante a conversão perderam-se 7.92% dos conceitos. Concluiu-se
que, embora o sistema desenvolvido ainda tenha margem para melhorias,
este demonstrou-se como um método viável de processamento automático
do texto de narrativas clínicas.Mestrado em Engenharia de Computadores e Telemátic
J Biomed Inform
We followed a systematic approach based on the Preferred Reporting Items for Systematic Reviews and Meta-Analyses to identify existing clinical natural language processing (NLP) systems that generate structured information from unstructured free text. Seven literature databases were searched with a query combining the concepts of natural language processing and structured data capture. Two reviewers screened all records for relevance during two screening phases, and information about clinical NLP systems was collected from the final set of papers. A total of 7149 records (after removing duplicates) were retrieved and screened, and 86 were determined to fit the review criteria. These papers contained information about 71 different clinical NLP systems, which were then analyzed. The NLP systems address a wide variety of important clinical and research tasks. Certain tasks are well addressed by the existing systems, while others remain as open challenges that only a small number of systems attempt, such as extraction of temporal information or normalization of concepts to standard terminologies. This review has identified many NLP systems capable of processing clinical free text and generating structured output, and the information collected and evaluated here will be important for prioritizing development of new approaches for clinical NLP.CC999999/ImCDC/Intramural CDC HHS/United States2019-11-20T00:00:00Z28729030PMC6864736694
Extracting clinical knowledge from electronic medical records
As the adoption of Electronic Medical Records (EMRs) rises in the healthcare
institutions, these resources’ importance increases due to all clinical information they
contain about patients. However, the unstructured information in the form of clinical
narratives present in these records makes it hard to extract and structure useful clinical
knowledge. This unstructured information limits the potential of the EMRs because the
clinical information these records contain can be used to perform essential tasks inside
healthcare institutions such as searching, summarization, decision support and statistical
analysis, as well as be used to support management decisions or serve for research. These
tasks can only be done if the unstructured clinical information from the narratives is
appropriately extracted, structured and processed in clinical knowledge. Usually, this
information extraction and structuration in clinical knowledge is performed manually by
healthcare practitioners, which is not efficient and is error-prone. This research aims to
propose a solution to this problem, by using Machine Translation (MT) from the
Portuguese language to the English language, Natural Language Processing (NLP) and
Information Extraction (IE) techniques. With the help of these techniques, the goal is to
develop a prototype pipeline modular system that can extract clinical knowledge from
unstructured clinical information contained in Portuguese EMRs, in an automated way,
in order to help EMRs to fulfil their potential and consequently help the Portuguese
hospital involved in this research. This research also intends to show that this generic
prototype system and approach can potentially be applied to other hospitals, even if they
don’t use the Portuguese language.Com a adopção cada vez maior das instituições de saúde face aos Processos Clínicos
Electrónicos (PCE), estes documentos ganham cada vez mais importância em contexto
clínico, devido a toda a informação clínica que contêm relativamente aos pacientes. No
entanto, a informação não estruturada na forma de narrativas clínicas presente nestes
documentos electrónicos, faz com que seja difícil extrair e estruturar deles conhecimento
clínico. Esta informação não estruturada limita o potencial dos PCE, uma vez que essa
mesma informação, caso seja extraída e estruturada devidamente, pode servir para que as
instituições de saúde possam efectuar actividades importantes com maior eficiência e
sucesso, como por exemplo actividades de pesquisa, sumarização, apoio à decisão,
análises estatísticas, suporte a decisões de gestão e de investigação. Este tipo de
actividades apenas podem ser feitas com sucesso caso a informação clínica não
estruturada presente nos PCE seja devidamente extraída, estruturada e processada em
conhecimento clínico. Habitualmente, esta extração é realizada manualmente pelos
profissionais médicos, o que não é eficiente e é susceptível a erros. Esta dissertação
pretende então propôr uma solução para este problema, ao utilizar técnicas de Tradução
Automática (TA) da língua portuguesa para a língua inglesa, Processamento de
Linguagem Natural (PLN) e Extração de Informação (EI). O objectivo é desenvolver um
sistema protótipo de módulos em série que utilize estas técnicas, possibilitando a extração
de conhecimento clínico, de uma forma automática, de informação clínica não estruturada
presente nos PCE de um hospital português. O principal objetivo é ajudar os PCE a
atingirem todo o seu potencial em termos de conhecimento clínico que contêm e
consequentemente ajudar o hospital português em questão envolvido nesta dissertação,
demonstrando também que este sistema protótipo e esta abordagem podem
potencialmente ser aplicados a outros hospitais, mesmo que não sejam de língua
portuguesa
Semantic annotation of medical documents in CDA context
The goal of this work is to recover semantic and structural information from medical documents in electronic format. Despite the progressive diffusion of Electronic Health Record systems, a lot of medical information, also for legacy reasons, is available to patients and physicians in image-only or textual format. The difficulties of obtaining such information when needed result in high costs for health providers. In this work we develop the concept of a system designed to convert legacy medical documents into a standard and interoperable format compliant with the Clinical Document Architecture model by the means of semantic annotation
Improving Syntactic Parsing of Clinical Text Using Domain Knowledge
Syntactic parsing is one of the fundamental tasks of Natural Language Processing (NLP). However, few studies have explored syntactic parsing in the medical domain. This dissertation systematically investigated different methods to improve the performance of syntactic parsing of clinical text, including (1) Constructing two clinical treebanks of discharge summaries and progress notes by developing annotation guidelines that handle missing elements in clinical sentences; (2) Retraining four state-of-the-art parsers, including the Stanford parser, Berkeley parser, Charniak parser, and Bikel parser, using clinical treebanks, and comparing their performance to identify better parsing approaches; and (3) Developing new methods to reduce syntactic ambiguity caused by Prepositional Phrase (PP) attachment and coordination using semantic information.
Our evaluation showed that clinical treebanks greatly improved the performance of existing parsers. The Berkeley parser achieved the best F-1 score of 86.39% on the MiPACQ treebank. For PP attachment, our proposed methods improved the accuracies of PP attachment by 2.35% on the MiPACQ corpus and 1.77% on the I2b2 corpus. For coordination, our method achieved a precision of 94.9% and a precision of 90.3% for the MiPACQ and i2b2 corpus, respectively. To further demonstrate the effectiveness of the improved parsing approaches, we applied outputs of our parsers to two external NLP tasks: semantic role labeling and temporal relation extraction. The experimental results showed that performance of both tasks’ was improved by using the parse tree information from our optimized parsers, with an improvement of 3.26% in F-measure for semantic role labelling and an improvement of 1.5% in F-measure for temporal relation extraction
Extracting knowledge from documents related with invasive fungal infections in iron overload context
Dissertação de Mestrado em BioinformáticaInvasive fungal infections caused by Candida are associated with high mortality and morbidity
rates in hospitalized patients. Iron plays a major role in these infections, as they are exacerbated under
iron overload conditions. In this context, it is important to understand the association between iron
levels and invasive fungal infections, as it can serve as an indicator of the severity of the disease, and
eventually it can help establish measures to improve treatment efficacy.
Nowadays, manually inferring these associations from biomedical documents is a time consuming task, due to the high amount of available scientific text data. As such, these tasks naturally
benefit from the Biomedical Text Mining field, which includes a wide variety of methods for automatic
extraction of high-quality information from biomedical text documents.
In this work, relevant documents related to iron overload and fungal infections were retrieved
from PubMed to build a corpus. Then, both Named Entity Recognition and Relation Extraction
processes were executed using the @Note text mining tool. Finally, relevant sentences were manually
extracted and a curated dataset with documents containing those sentences was created.
Since the number of publications obtained about Candida and iron overload was very low, the
analysis was made taking into account all fungi. A total of 15 publications were considered relevant and
168 relevant associations were extracted.
Although associations of iron levels with both severity of infection and treatment efficacy were not
extracted, it was possible to conclude that, in many cases, iron overload is a predictor for fungal
infections, and patients’ iron levels highly affect treatment efficacy.
The Biomedical Text Mining process described in the present thesis enabled the creation of a
dataset of relevant biomedical publications containing interesting associations between fungal
infections, drugs and associated diseases in a clinical context of iron overload, although in the future
this process could be improved, especially regarding dictionaries, in order to obtain a higher number of
relevant publications.As infeções fúngicas invasivas causadas por Candida estão associadas a elevadas taxas de
mortalidade e morbilidade em doentes hospitalizados. O ferro tem um papel importante neste tipo de
infeções, visto que estas são exacerbadas em condições de excesso de ferro. Neste contexto, é
extremamente importante compreender a associação entre os níveis de ferro e infeções fúngicas
invasivas, pois pode servir como indicador da severidade da doença e, eventualmente, ajudar a
estabelecer medidas para melhorar a eficácia de tratamento.
Atualmente, inferir manualmente este tipo de associações de documentos biomédicos revela-se
uma tarefa bastante demorada, devido ao elevado volume de dados de texto científico disponíveis.
Como tal, estas tarefas beneficiam claramente da área da mineração de textos biomédicos, que inclui
uma ampla variedade de métodos para extração de informação de alta qualidade de documentos de
texto biomédicos.
No presente trabalho, foram identificados, inicialmente, documentos relevantes que associam o
ferro com infeções fúngicas invasivas para construir um corpus. De seguida, os processos de
Reconhecimento de entidades nomeadas e Extração de relações foram realizados usando a ferramenta
de mineração de textos @Note. Finalmente, as frases mais relevantes foram extraídas e foi criado um
corpus curado de documentos contendo essas mesmas frases.
Visto que o número de publicações obtidas relacionadas com Candida e excesso de ferro foi
muito baixo, a análise foi feita tendo em conta todos os fungos. Um total de 15 publicações foram
consideradas relevantes e 168 associações foram extraídas.
Embora não tivesse sido possível extrair associações entre níveis de ferro e a eficácia do
tratamento/severidade da infeção, foi possível concluir que o excesso de ferro prevê o surgimento de
infeções fúngicas em muitos casos, e que os níveis de ferro dos pacientes afetam fortemente a eficácia
do tratamento.
O processo de mineração de textos biomédicos no presente trabalho possibilitou a criação de um
corpus de publicações biomédicas relevantes contendo associações interessantes entre infeções
fúngicas, fármacos e doenças associadas, no contexto clínico de excesso de ferro, embora este
processo pudesse ser melhorado no futuro, especialmente no que diz respeito aos dicionários, para
que seja possível a obtenção de um maior número de publicações relevantes
Methods and Techniques for Clinical Text Modeling and Analytics
Nowadays, a large portion of clinical data only exists in free text. The wide adoption of Electronic Health Records (EHRs) has enabled the increases in accessing to clinical documents, which provide challenges and opportunities for clinical Natural Language Processing (NLP) researchers. Given free-text clinical notes as input, an ideal system for clinical text understanding should have the ability to support clinical decisions. At corpus level, the system should recommend similar notes based on disease or patient types, and provide medication recommendation, or any other type of recommendations, based on patients' symptoms and other similar medical cases. At document level, it should return a list of important clinical concepts. Moreover, the system should be able to make diagnostic inferences over clinical concepts and output diagnosis. Unfortunately, current work has not systematically studied this system. This study focuses on developing and applying methods/techniques in different aspects of the system for clinical text understanding, at both corpus and document level. We deal with two major research questions: First, we explore the question of How to model the underlying relationships from clinical notes at corpus level? Documents clustering methods can group clinical notes into meaningful clusters, which can assist physicians and patients to understand medical conditions and diseases from clinical notes. We use Nonnegative Matrix Factorization (NMF) and Multi-view NMF to cluster clinical notes based on extracted medical concepts. The clustering results display latent patterns existed among clinical notes. Our method provides a feasible way to visualize a corpus of clinical documents. Based on extracted concepts, we further build a symptom-medication (Symp-Med) graph to model the Symp-Med relations in clinical notes corpus. We develop two Symp-Med matching algorithms to predict and recommend medications for patients based on their symptoms. Second, we want to solve the question of How to integrate structured knowledge with unstructured text to improve results for Clinical NLP tasks? On the one hand, the unstructured clinical text contains lots of information about medical conditions. On the other hand, structured Knowledge Bases (KBs) are frequently used for supporting clinical NLP tasks. We propose graph-regularized word embedding models to integrate knowledge from both KBs and free text. We evaluate our models on standard datasets and biomedical NLP tasks, and results showed encouraging improvements on both datasets. We further apply the graph-regularized word embedding models and present a novel approach to automatically infer the most probable diagnosis from a given clinical narrative.Ph.D., Information Studies -- Drexel University, 201