96 research outputs found

    Fine-Tuning Bidirectional Encoder Representations From Transformers (BERT)-Based Models on Large-Scale Electronic Health Record Notes: An Empirical Study

    Get PDF
    BACKGROUND: The bidirectional encoder representations from transformers (BERT) model has achieved great success in many natural language processing (NLP) tasks, such as named entity recognition and question answering. However, little prior work has explored this model to be used for an important task in the biomedical and clinical domains, namely entity normalization. OBJECTIVE: We aim to investigate the effectiveness of BERT-based models for biomedical or clinical entity normalization. In addition, our second objective is to investigate whether the domains of training data influence the performances of BERT-based models as well as the degree of influence. METHODS: Our data was comprised of 1.5 million unlabeled electronic health record (EHR) notes. We first fine-tuned BioBERT on this large collection of unlabeled EHR notes. This generated our BERT-based model trained using 1.5 million electronic health record notes (EhrBERT). We then further fine-tuned EhrBERT, BioBERT, and BERT on three annotated corpora for biomedical and clinical entity normalization: the Medication, Indication, and Adverse Drug Events (MADE) 1.0 corpus, the National Center for Biotechnology Information (NCBI) disease corpus, and the Chemical-Disease Relations (CDR) corpus. We compared our models with two state-of-the-art normalization systems, namely MetaMap and disease name normalization (DNorm). RESULTS: EhrBERT achieved 40.95% F1 in the MADE 1.0 corpus for mapping named entities to the Medical Dictionary for Regulatory Activities and the Systematized Nomenclature of Medicine-Clinical Terms (SNOMED-CT), which have about 380,000 terms. In this corpus, EhrBERT outperformed MetaMap by 2.36% in F1. For the NCBI disease corpus and CDR corpus, EhrBERT also outperformed DNorm by improving the F1 scores from 88.37% and 89.92% to 90.35% and 93.82%, respectively. Compared with BioBERT and BERT, EhrBERT outperformed them on the MADE 1.0 corpus and the CDR corpus. CONCLUSIONS: Our work shows that BERT-based models have achieved state-of-the-art performance for biomedical and clinical entity normalization. BERT-based models can be readily fine-tuned to normalize any kind of named entities

    Disease Name Extraction from Clinical Text Using Conditional Random Fields

    Get PDF
    The aim of the research done in this thesis was to extract disease and disorder names from clinical texts. We utilized Conditional Random Fields (CRF) as the main method to label diseases and disorders in clinical sentences. We used some other tools such as MetaMap and Stanford Core NLP tool to extract some crucial features. MetaMap tool was used to identify names of diseases/disorders that are already in UMLS Metathesaurus. Some other important features such as lemmatized versions of words, and POS tags were extracted using the Stanford Core NLP tool. Some more features were extracted directly from UMLS Metathesaurus, including semantic types of words. We participated in the SemEval 2014 competition\u27s Task 7 and used its provided data to train and evaluate our system. Training data contained 199 clinical texts, development data contained 99 clinical texts, and the test data contained 133 clinical texts, these included discharge summaries, echocardiogram, radiology, and ECG reports. We obtained competitive results on the disease/disorder name extraction task. We found through ablation study that while all features contributed, MetaMap matches, POS tags, and previous and next words were the most effective features

    Incorporating Ontological Information in Biomedical Entity Linking of Phrases in Clinical Text

    Get PDF
    Biomedical Entity Linking (BEL) is the task of mapping spans of text within biomedical documents to normalized, unique identifiers within an ontology. Translational application of BEL on clinical notes has enormous potential for augmenting discretely captured data in electronic health records, but the existing paradigm for evaluating BEL systems developed in academia is not well aligned with real-world use cases. In this work, we demonstrate a proof of concept for incorporating ontological similarity into the training and evaluation of BEL systems to begin to rectify this misalignment. This thesis has two primary components: 1) a comprehensive literature review and 2) a methodology section to propose novel BEL techniques to contribute to scientific progress in the field. In the literature review component, I survey the progression of BEL from its inception in the late 80s to present day state of the art systems, provide a comprehensive list of datasets available for training BEL systems, reference shared tasks focused on BEL, and outline the technical components that vii comprise BEL systems. In the methodology component, I describe my experiments incorporating ontological information into training a BERT encoder for entity linking

    Normalization of Disease Mentions with Convolutional Neural Networks

    Get PDF
    Normalization of disease mentions has an important role in biomedical natural language processing (BioNLP) applications, such as the construction of biomedical databases. Various disease mention normalization systems have been developed, though state-of-the-art systems either rely on candidate concept generation, or do not generalize to new concepts not seen during training. This thesis explores the possibility of building a disease mention normalization system that both generalizes to unseen concepts and does not rely on candidate generation. To this end, it is hypothesized that modern neural networks are sophisticated enough to solve this problem. This hypothesis is tested by building a normalization system using deep learning approaches, and evaluating the accuracy of this system on the NCBI disease corpus. The system leverages semantic information in the biomedical literature by using continuous vector space representations for strings of disease mentions and concepts. A neural encoder is trained to encode vector representations of strings of disease mentions and concepts. This encoder theoretically enables the model to generalize to unseen concepts during training. The encoded strings are used to compare the similarity between concepts and a given mention. Viewing normalization as a ranking problem, the concept with the highest similarity estimated is selected as the predicted concept for the mention. For the development of the system, synthetic data is used for pre-training to facilitate the learning of the model. In addition, various architectures are explored. While the model succeeds in prediction without candidate concept generation, its performance is not comparable to those of the state-of-the-art systems. Normalization of disease mentions without candidate generation while including the possibility for the system to generalize to unseen concepts is not trivial. Further efforts can be focused on, for example, testing more neural architectures, and the use of more sophisticated word representations

    Utilização de dados estruturados na resposta a perguntas relacionadas com saúde

    Get PDF
    The current standard way of searching for information is through the usage of some kind of search engine. Even though there has been progress, it still is mainly based on the retrieval of a list of documents in which the words you searched for appear. Since the users goal is to find an answer to a question, having to look through multiple documents hoping that one of them have the information they are looking for is not very efficient. The aim of this thesis is to improve that process of searching for information, in this case of medical knowledge in two different ways, the first one is replacing the usual keywords used in search engines for something that is more natural to humans, a question in its natural form. The second one is to make use of the additional information that is present in a question format to provide the user an answer for that same question instead of a list of documents where those keywords are present. Since social media are the place where people replace the queries used on a search engine for questions that are usually answered by humans, it seems the natural place to look for the questions that we aim to provide with automatic answers. The first step to provide an answer to those questions will be to classify them in order to find what kind of information should be present in its answer. The second step is to identify the keywords that would be present if this was to be searched through the currently standard way. Having the keywords identified and knowing what kind of information the question aims to retrieve, it is now possible to map it into a query format and retrieve the information needed to provide an answer.Atualmente a forma mais comum de procurar informação é através da utilização de um motor de busca. Apesar de haver progresso os seus resultados continuam a ser maioritariamente baseados na devolução de uma lista de documentos onde estão presentes as palavras utilizadas na pesquisa, tendo o utilizador posteriormente que percorrer um conjunto dos documentos apresentados na esperança de obter a informação que procura. Para além de ser uma forma menos natural de procurar informação também é menos eficiente. O objetivo para esta tese é melhorar esse processo de procura de informação, sendo neste caso o foco a área da saúde. Estas melhorias aconteceriam de duas formas diferentes, sendo a primeira a substituição da query normalmente utilizada em motores de busca, por algo que nos é mais natural - uma pergunta. E a segunda seria aproveitar a informação adicional a que temos acesso apenas no formato de pergunta, para fornecer os dados necessários à sua resposta em vez de uma lista de documentos onde um conjunto de palavras-chave estão presentes. Sendo as redes sociais o local onde a busca por informação acontece através da utilização de perguntas, em substituição do que seria normal num motor de busca, pelo facto de a resposta nestas plataformas ser normalmente respondida por humanos e não máquinas. Parece assim ser o local natural para a recolha de perguntas para as quais temos o objetivo de fornecer uma ferramenta para a obtenção automática de uma resposta. O primeiro passo para ser possível fornecer esta resposta será a classificação das perguntas em diferentes tipos, tornando assim possível identificar qual a informação que se pretende obter. O segundo passo será identificar e categorizar as palavras de contexto biomédico presentes no texto fornecido, que seriam aquelas utilizadas caso a procura estivesse a ser feita utilizando as ferramentas convencionais. Tendo as palavras-chave sido identificadas e sabendo qual o tipo de informação que deverá estar presente na sua resposta. É agora possível mapear esta informação para um formato conhecido pelos computadores (query) e assim obter a informação pretendida.Mestrado em Engenharia Informátic

    Deep Neural Models for Medical Concept Normalization in User-Generated Texts

    Full text link
    In this work, we consider the medical concept normalization problem, i.e., the problem of mapping a health-related entity mention in a free-form text to a concept in a controlled vocabulary, usually to the standard thesaurus in the Unified Medical Language System (UMLS). This is a challenging task since medical terminology is very different when coming from health care professionals or from the general public in the form of social media texts. We approach it as a sequence learning problem with powerful neural networks such as recurrent neural networks and contextualized word representation models trained to obtain semantic representations of social media expressions. Our experimental evaluation over three different benchmarks shows that neural architectures leverage the semantic meaning of the entity mention and significantly outperform an existing state of the art models.Comment: This is preprint of the paper "Deep Neural Models for Medical Concept Normalization in User-Generated Texts" to be published at ACL 2019 - 57th Annual Meeting of the Association for Computational Linguistics, Proceedings of the Student Research Worksho
    corecore