23 research outputs found

    Information extraction from medication leaflets

    Get PDF
    Tese de mestrado integrado. Engenharia Informática e Computação. Faculdade de Engenharia. Universidade do Porto. 201

    Clinical records anonymisation and text extraction (CRATE): an open-source software system.

    Get PDF
    BACKGROUND: Electronic medical records contain information of value for research, but contain identifiable and often highly sensitive confidential information. Patient-identifiable information cannot in general be shared outside clinical care teams without explicit consent, but anonymisation/de-identification allows research uses of clinical data without explicit consent. RESULTS: This article presents CRATE (Clinical Records Anonymisation and Text Extraction), an open-source software system with separable functions: (1) it anonymises or de-identifies arbitrary relational databases, with sensitivity and precision similar to previous comparable systems; (2) it uses public secure cryptographic methods to map patient identifiers to research identifiers (pseudonyms); (3) it connects relational databases to external tools for natural language processing; (4) it provides a web front end for research and administrative functions; and (5) it supports a specific model through which patients may consent to be contacted about research. CONCLUSIONS: Creation and management of a research database from sensitive clinical records with secure pseudonym generation, full-text indexing, and a consent-to-contact process is possible and practical using entirely free and open-source software.The project was funded in part by the UK National Institute of Health Research Cambridge Biomedical Research Centre. The work was conducted within the Behavioural and Clinical Neuroscience Institute, University of Cambridge, supported by the Wellcome Trust and the UK Medical Research Council

    Processamento automático de texto de narrativas clínicas

    Get PDF
    The informatization of medical systems and the subsequent move towards the usage of Electronic Health Records (EHR) over the paper format by medical professionals allowed for safer and more e cient healthcare. Additionally, EHR can also be used as a data source for observational studies around the world. However, it is estimated that 70-80% of all clinical data is in the form of unstructured free text and regarding the data that is structured, not all of it follows the same standards, making it di cult to use on the mentioned observational studies. This dissertation aims to tackle those two adversities using natural language processing for the task of extracting concepts from free text and, afterwards, use a common data model to harmonize the data. The developed system employs an annotator, namely cTAKES, to extract the concepts from free text. The extracted concepts are then normalized using text preprocessing, word embeddings, MetaMap and UMLS Metathesaurus lookup. Finally, the normalized concepts are converted to the OMOP Common Data Model and stored in a database. In order to test the developed system, the i2b2 2010 data set was used. The di erent components of the system were tested and evaluated separately, with the concept extraction component achieving a precision, recall and F-score of 77.12%, 70.29% and 73.55%, respectively. The normalization component was evaluated by completing the N2C2 2019 challenge track 3, where it achieved a 77.5% accuracy. Finally, during the OMOP CDM conversion component, it was observed that 7.92% of the concepts were lost during the process. In conclusion, even though the developed system still has margin for improvements, it proves to be a viable method of automatically processing clinical narratives.A informatização dos sistemas médicos e a subsequente tendência por parte de profissionais de saúde a substituir registos em formato de papel por registos eletrónicos de saúde, permitiu que os serviços de saúde se tornassem mais seguros e eficientes. Além disso, estes registos eletrónicos apresentam também o benefício de poderem ser utilizados como fonte de dados para estudos observacionais. No entanto, estima-se que 70-80% de todos os dados clínicos se encontrem na forma de texto livre não-estruturado e os dados que estão estruturados não seguem todos os mesmos padrões, dificultando o seu potencial uso nos estudos observacionais. Esta dissertação pretende solucionar essas duas adversidades através do uso de processamento de linguagem natural para a tarefa de extrair conceitos de texto livre e, de seguida, usar um modelo comum de dados para os harmonizar. O sistema desenvolvido utiliza um anotador, especificamente o cTAKES, para extrair conceitos de texto livre. Os conceitos extraídos são, então, normalizados através de técnicas de pré-processamento de texto, Word Embeddings, MetaMap e um sistema de procura no Metathesaurus do UMLS. Por fim, os conceitos normalizados são convertidos para o modelo comum de dados da OMOP e guardados numa base de dados. Para testar o sistema desenvolvido usou-se o conjunto de dados i2b2 de 2010. As diferentes partes do sistema foram testadas e avaliadas individualmente sendo que na extração dos conceitos obteve-se uma precisão, recall e F-score de 77.12%, 70.29% e 73.55%, respetivamente. A normalização foi avaliada através do desafio N2C2 2019-track 3 onde se obteve uma exatidão de 77.5%. Na conversão para o modelo comum de dados OMOP observou-se que durante a conversão perderam-se 7.92% dos conceitos. Concluiu-se que, embora o sistema desenvolvido ainda tenha margem para melhorias, este demonstrou-se como um método viável de processamento automático do texto de narrativas clínicas.Mestrado em Engenharia de Computadores e Telemátic

    J Biomed Inform

    Get PDF
    We followed a systematic approach based on the Preferred Reporting Items for Systematic Reviews and Meta-Analyses to identify existing clinical natural language processing (NLP) systems that generate structured information from unstructured free text. Seven literature databases were searched with a query combining the concepts of natural language processing and structured data capture. Two reviewers screened all records for relevance during two screening phases, and information about clinical NLP systems was collected from the final set of papers. A total of 7149 records (after removing duplicates) were retrieved and screened, and 86 were determined to fit the review criteria. These papers contained information about 71 different clinical NLP systems, which were then analyzed. The NLP systems address a wide variety of important clinical and research tasks. Certain tasks are well addressed by the existing systems, while others remain as open challenges that only a small number of systems attempt, such as extraction of temporal information or normalization of concepts to standard terminologies. This review has identified many NLP systems capable of processing clinical free text and generating structured output, and the information collected and evaluated here will be important for prioritizing development of new approaches for clinical NLP.CC999999/ImCDC/Intramural CDC HHS/United States2019-11-20T00:00:00Z28729030PMC6864736694

    Extracting clinical knowledge from electronic medical records

    Get PDF
    As the adoption of Electronic Medical Records (EMRs) rises in the healthcare institutions, these resources’ importance increases due to all clinical information they contain about patients. However, the unstructured information in the form of clinical narratives present in these records makes it hard to extract and structure useful clinical knowledge. This unstructured information limits the potential of the EMRs because the clinical information these records contain can be used to perform essential tasks inside healthcare institutions such as searching, summarization, decision support and statistical analysis, as well as be used to support management decisions or serve for research. These tasks can only be done if the unstructured clinical information from the narratives is appropriately extracted, structured and processed in clinical knowledge. Usually, this information extraction and structuration in clinical knowledge is performed manually by healthcare practitioners, which is not efficient and is error-prone. This research aims to propose a solution to this problem, by using Machine Translation (MT) from the Portuguese language to the English language, Natural Language Processing (NLP) and Information Extraction (IE) techniques. With the help of these techniques, the goal is to develop a prototype pipeline modular system that can extract clinical knowledge from unstructured clinical information contained in Portuguese EMRs, in an automated way, in order to help EMRs to fulfil their potential and consequently help the Portuguese hospital involved in this research. This research also intends to show that this generic prototype system and approach can potentially be applied to other hospitals, even if they don’t use the Portuguese language.Com a adopção cada vez maior das instituições de saúde face aos Processos Clínicos Electrónicos (PCE), estes documentos ganham cada vez mais importância em contexto clínico, devido a toda a informação clínica que contêm relativamente aos pacientes. No entanto, a informação não estruturada na forma de narrativas clínicas presente nestes documentos electrónicos, faz com que seja difícil extrair e estruturar deles conhecimento clínico. Esta informação não estruturada limita o potencial dos PCE, uma vez que essa mesma informação, caso seja extraída e estruturada devidamente, pode servir para que as instituições de saúde possam efectuar actividades importantes com maior eficiência e sucesso, como por exemplo actividades de pesquisa, sumarização, apoio à decisão, análises estatísticas, suporte a decisões de gestão e de investigação. Este tipo de actividades apenas podem ser feitas com sucesso caso a informação clínica não estruturada presente nos PCE seja devidamente extraída, estruturada e processada em conhecimento clínico. Habitualmente, esta extração é realizada manualmente pelos profissionais médicos, o que não é eficiente e é susceptível a erros. Esta dissertação pretende então propôr uma solução para este problema, ao utilizar técnicas de Tradução Automática (TA) da língua portuguesa para a língua inglesa, Processamento de Linguagem Natural (PLN) e Extração de Informação (EI). O objectivo é desenvolver um sistema protótipo de módulos em série que utilize estas técnicas, possibilitando a extração de conhecimento clínico, de uma forma automática, de informação clínica não estruturada presente nos PCE de um hospital português. O principal objetivo é ajudar os PCE a atingirem todo o seu potencial em termos de conhecimento clínico que contêm e consequentemente ajudar o hospital português em questão envolvido nesta dissertação, demonstrando também que este sistema protótipo e esta abordagem podem potencialmente ser aplicados a outros hospitais, mesmo que não sejam de língua portuguesa

    Semantic annotation of medical documents in CDA context

    Get PDF
    The goal of this work is to recover semantic and structural information from medical documents in electronic format. Despite the progressive diffusion of Electronic Health Record systems, a lot of medical information, also for legacy reasons, is available to patients and physicians in image-only or textual format. The difficulties of obtaining such information when needed result in high costs for health providers. In this work we develop the concept of a system designed to convert legacy medical documents into a standard and interoperable format compliant with the Clinical Document Architecture model by the means of semantic annotation

    Improving Syntactic Parsing of Clinical Text Using Domain Knowledge

    Get PDF
    Syntactic parsing is one of the fundamental tasks of Natural Language Processing (NLP). However, few studies have explored syntactic parsing in the medical domain. This dissertation systematically investigated different methods to improve the performance of syntactic parsing of clinical text, including (1) Constructing two clinical treebanks of discharge summaries and progress notes by developing annotation guidelines that handle missing elements in clinical sentences; (2) Retraining four state-of-the-art parsers, including the Stanford parser, Berkeley parser, Charniak parser, and Bikel parser, using clinical treebanks, and comparing their performance to identify better parsing approaches; and (3) Developing new methods to reduce syntactic ambiguity caused by Prepositional Phrase (PP) attachment and coordination using semantic information. Our evaluation showed that clinical treebanks greatly improved the performance of existing parsers. The Berkeley parser achieved the best F-1 score of 86.39% on the MiPACQ treebank. For PP attachment, our proposed methods improved the accuracies of PP attachment by 2.35% on the MiPACQ corpus and 1.77% on the I2b2 corpus. For coordination, our method achieved a precision of 94.9% and a precision of 90.3% for the MiPACQ and i2b2 corpus, respectively. To further demonstrate the effectiveness of the improved parsing approaches, we applied outputs of our parsers to two external NLP tasks: semantic role labeling and temporal relation extraction. The experimental results showed that performance of both tasks’ was improved by using the parse tree information from our optimized parsers, with an improvement of 3.26% in F-measure for semantic role labelling and an improvement of 1.5% in F-measure for temporal relation extraction

    Extracting knowledge from documents related with invasive fungal infections in iron overload context

    Get PDF
    Dissertação de Mestrado em BioinformáticaInvasive fungal infections caused by Candida are associated with high mortality and morbidity rates in hospitalized patients. Iron plays a major role in these infections, as they are exacerbated under iron overload conditions. In this context, it is important to understand the association between iron levels and invasive fungal infections, as it can serve as an indicator of the severity of the disease, and eventually it can help establish measures to improve treatment efficacy. Nowadays, manually inferring these associations from biomedical documents is a time consuming task, due to the high amount of available scientific text data. As such, these tasks naturally benefit from the Biomedical Text Mining field, which includes a wide variety of methods for automatic extraction of high-quality information from biomedical text documents. In this work, relevant documents related to iron overload and fungal infections were retrieved from PubMed to build a corpus. Then, both Named Entity Recognition and Relation Extraction processes were executed using the @Note text mining tool. Finally, relevant sentences were manually extracted and a curated dataset with documents containing those sentences was created. Since the number of publications obtained about Candida and iron overload was very low, the analysis was made taking into account all fungi. A total of 15 publications were considered relevant and 168 relevant associations were extracted. Although associations of iron levels with both severity of infection and treatment efficacy were not extracted, it was possible to conclude that, in many cases, iron overload is a predictor for fungal infections, and patients’ iron levels highly affect treatment efficacy. The Biomedical Text Mining process described in the present thesis enabled the creation of a dataset of relevant biomedical publications containing interesting associations between fungal infections, drugs and associated diseases in a clinical context of iron overload, although in the future this process could be improved, especially regarding dictionaries, in order to obtain a higher number of relevant publications.As infeções fúngicas invasivas causadas por Candida estão associadas a elevadas taxas de mortalidade e morbilidade em doentes hospitalizados. O ferro tem um papel importante neste tipo de infeções, visto que estas são exacerbadas em condições de excesso de ferro. Neste contexto, é extremamente importante compreender a associação entre os níveis de ferro e infeções fúngicas invasivas, pois pode servir como indicador da severidade da doença e, eventualmente, ajudar a estabelecer medidas para melhorar a eficácia de tratamento. Atualmente, inferir manualmente este tipo de associações de documentos biomédicos revela-se uma tarefa bastante demorada, devido ao elevado volume de dados de texto científico disponíveis. Como tal, estas tarefas beneficiam claramente da área da mineração de textos biomédicos, que inclui uma ampla variedade de métodos para extração de informação de alta qualidade de documentos de texto biomédicos. No presente trabalho, foram identificados, inicialmente, documentos relevantes que associam o ferro com infeções fúngicas invasivas para construir um corpus. De seguida, os processos de Reconhecimento de entidades nomeadas e Extração de relações foram realizados usando a ferramenta de mineração de textos @Note. Finalmente, as frases mais relevantes foram extraídas e foi criado um corpus curado de documentos contendo essas mesmas frases. Visto que o número de publicações obtidas relacionadas com Candida e excesso de ferro foi muito baixo, a análise foi feita tendo em conta todos os fungos. Um total de 15 publicações foram consideradas relevantes e 168 associações foram extraídas. Embora não tivesse sido possível extrair associações entre níveis de ferro e a eficácia do tratamento/severidade da infeção, foi possível concluir que o excesso de ferro prevê o surgimento de infeções fúngicas em muitos casos, e que os níveis de ferro dos pacientes afetam fortemente a eficácia do tratamento. O processo de mineração de textos biomédicos no presente trabalho possibilitou a criação de um corpus de publicações biomédicas relevantes contendo associações interessantes entre infeções fúngicas, fármacos e doenças associadas, no contexto clínico de excesso de ferro, embora este processo pudesse ser melhorado no futuro, especialmente no que diz respeito aos dicionários, para que seja possível a obtenção de um maior número de publicações relevantes

    Methods and Techniques for Clinical Text Modeling and Analytics

    Get PDF
    Nowadays, a large portion of clinical data only exists in free text. The wide adoption of Electronic Health Records (EHRs) has enabled the increases in accessing to clinical documents, which provide challenges and opportunities for clinical Natural Language Processing (NLP) researchers. Given free-text clinical notes as input, an ideal system for clinical text understanding should have the ability to support clinical decisions. At corpus level, the system should recommend similar notes based on disease or patient types, and provide medication recommendation, or any other type of recommendations, based on patients' symptoms and other similar medical cases. At document level, it should return a list of important clinical concepts. Moreover, the system should be able to make diagnostic inferences over clinical concepts and output diagnosis. Unfortunately, current work has not systematically studied this system. This study focuses on developing and applying methods/techniques in different aspects of the system for clinical text understanding, at both corpus and document level. We deal with two major research questions: First, we explore the question of How to model the underlying relationships from clinical notes at corpus level? Documents clustering methods can group clinical notes into meaningful clusters, which can assist physicians and patients to understand medical conditions and diseases from clinical notes. We use Nonnegative Matrix Factorization (NMF) and Multi-view NMF to cluster clinical notes based on extracted medical concepts. The clustering results display latent patterns existed among clinical notes. Our method provides a feasible way to visualize a corpus of clinical documents. Based on extracted concepts, we further build a symptom-medication (Symp-Med) graph to model the Symp-Med relations in clinical notes corpus. We develop two Symp-Med matching algorithms to predict and recommend medications for patients based on their symptoms. Second, we want to solve the question of How to integrate structured knowledge with unstructured text to improve results for Clinical NLP tasks? On the one hand, the unstructured clinical text contains lots of information about medical conditions. On the other hand, structured Knowledge Bases (KBs) are frequently used for supporting clinical NLP tasks. We propose graph-regularized word embedding models to integrate knowledge from both KBs and free text. We evaluate our models on standard datasets and biomedical NLP tasks, and results showed encouraging improvements on both datasets. We further apply the graph-regularized word embedding models and present a novel approach to automatically infer the most probable diagnosis from a given clinical narrative.Ph.D., Information Studies -- Drexel University, 201
    corecore