236 research outputs found

    Explainable clinical coding with in-domain adapted transformers

    Get PDF
    Background and Objective: Automatic clinical coding is a crucial task in the process of extracting relevant in-formation from unstructured medical documents contained in Electronic Health Records (EHR). However, most of the existing computer-based methods for clinical coding act as “black boxes”, without giving a detailed description of the reasons for the clinical-coding assignments, which greatly limits their applicability to real-world medical scenarios. The objective of this study is to use transformer-based models to effectively tackle explainable clinical-coding. In this way, we require the models to perform the assignments of clinical codes to medical cases, but also to provide the reference in the text that justifies each coding assignment. Methods: We examine the performance of 3 transformer-based architectures on 3 different explainable clinical-coding tasks. For each transformer, we compare the performance of the original general-domain version with an in-domain version of the model adapted to the specificities of the medical domain. We address the explainable clinical-coding problem as a dual medical named entity recognition (MER) and medical named entity normal-ization (MEN) task. For this purpose, we have developed two different approaches, namely a multi-task and a hierarchical-task strategy. Results: For each analyzed transformer, the clinical-domain version significantly outperforms the corresponding general domain model across the 3 explainable clinical-coding tasks analyzed in this study. Furthermore, the hierarchical-task approach yields a significantly superior performance than the multi-task strategy. Specifically, the combination of the hierarchical-task strategy with an ensemble approach leveraging the predictive capa-bilities of the 3 distinct clinical-domain transformersFunding for open access charge: Universidad de Málaga / CBUA. The authors thankfully acknowledge the computer resources, technical expertise and assistance provided by the SCBI (Supercomputing and Bioinformatics) center of the University of Málaga

    Incorporating Ontological Information in Biomedical Entity Linking of Phrases in Clinical Text

    Get PDF
    Biomedical Entity Linking (BEL) is the task of mapping spans of text within biomedical documents to normalized, unique identifiers within an ontology. Translational application of BEL on clinical notes has enormous potential for augmenting discretely captured data in electronic health records, but the existing paradigm for evaluating BEL systems developed in academia is not well aligned with real-world use cases. In this work, we demonstrate a proof of concept for incorporating ontological similarity into the training and evaluation of BEL systems to begin to rectify this misalignment. This thesis has two primary components: 1) a comprehensive literature review and 2) a methodology section to propose novel BEL techniques to contribute to scientific progress in the field. In the literature review component, I survey the progression of BEL from its inception in the late 80s to present day state of the art systems, provide a comprehensive list of datasets available for training BEL systems, reference shared tasks focused on BEL, and outline the technical components that vii comprise BEL systems. In the methodology component, I describe my experiments incorporating ontological information into training a BERT encoder for entity linking

    Snomed CT in a Language Isolate: an Algorithm for a Semiautomatic Translation

    Get PDF
    Background:: The Systematized Nomenclature of Medicine - Clinical Terms (SNOMED CT) is officially released in English and Spanish. In the Basque Autonomous Community two languages, Spanish and Basque, are official. The first attempt to semi-automatically translate the SNOMED CT terminology content to Basque, a less resourced language is presented in this paper. Methods:: A translation algorithm that has its basis in Natural Language Processing methods has been designed and partially implemented. The algorithm comprises four phases from which the first two have been implemented and quantitatively evaluated. Results:: Results are promising as we obtained the equivalents in Basque of 21.41% of the disorder terms of the English SNOMED CT release. As the methods developed are focused on that hierarchy, the results in other hierarchies are lower (12.57% for body structure descriptions, 8.80% for findings and 3% for procedures). Conclusions:: We are in the way to reach two of our objectives when translating SNOMED CT to Basque: to use our language to access rich multilingual resources and to strengthen the use of the Basque language in the biomedical area.This work was partially supported by the European Commission (325099), the Spanish Ministry of Science and Innovation (TIN2012-38584-C06-02) and the Basque Government (IT344-10 and IE12-333). Olatz Perez-de-Viñaspre's work is funded by a PhD grant from the Basque Government (BFI-2011-389)

    Mapping of electronic health records in Spanish to the unified medical language system metathesaurus

    Get PDF
    [EN] This work presents a preliminary approach to annotate Spanish electronic health records with concepts of the Unified Medical Language System Metathesaurus. The prototype uses Apache Lucene R to index the Metathesaurus and generate mapping candidates from input text. In addition, it relies on UKB to resolve ambiguities. The tool has been evaluated by measuring its agreement with MetaMap in two English-Spanish parallel corpora, one consisting of titles and abstracts of papers in the clinical domain, and the other of real electronic health record excerpts.[EU] Lan honetan, espainieraz idatzitako mediku-txosten elektronikoak Unified Medical Languge System Metathesaurus deituriko terminologia biomedikoarekin etiketatzeko lehen urratsak eman dira. Prototipoak Apache Lucene R erabiltzen du Metathesaurus-a indexatu eta mapatze hautagaiak sortzeko. Horrez gain, anbiguotasunak UKB bidez ebazten ditu. Ebaluazioari dagokionez, prototipoaren eta MetaMap-en arteko adostasuna neurtu da bi ingelera-gaztelania corpus paralelotan. Corpusetako bat artikulu zientifikoetako izenburu eta laburpenez osatutako dago. Beste corpusa mediku-txosten pasarte batzuez dago osatuta

    Entity Linking in Low-Annotation Data Settings

    Get PDF
    Recent advances in natural language processing have focused on applying and adapting large pretrained language models to specific tasks. These models, such as BERT (Devlin et al., 2019) and BART (Lewis et al., 2020a), are pretrained on massive amounts of unlabeled text across a variety of domains. The impact of these pretrained models is visible in the task of entity linking, where a mention of an entity in unstructured text is matched to the relevant entry in a knowledge base. State-of-the-art linkers, such as Wu et al. (2020) and De Cao et al. (2021), leverage pretrained models as a foundation for their systems. However, these models are also trained on large amounts of annotated data, which is crucial to their performance. Often these large datasets consist of domains that are easily annotated, such as Wikipedia or newswire text. However, tailoring NLP tools to a narrow variety of textual domains severely restricts their use in the real world. Many other domains, such as medicine or law, do not have large amounts of entity linking annotations available. Entity linking, which serves to bridge the gap between massive unstructured amounts of text and structured repositories of knowledge, is equally crucial in these domains. Yet tools trained on newswire or Wikipedia annotations are unlikely to be well-suited for identifying medical conditions mentioned in clinical notes. As most annotation efforts focus on English, similar challenges can be noted in building systems for non-English text. There is often a relatively small amount of annotated data in these domains. With this being the case, looking to other types of domain-specific data, such as unannotated text or highly-curated structured knowledge bases, is often required. In these settings, it is crucial to translate lessons taken from tools tailored for high-annotation domains into algorithms that are suited for low-annotation domains. This requires both leveraging broader types of data and understanding the unique challenges present in each domain

    Information extraction from Spanish radiology reports

    Get PDF
    En los últimos a˜nos, la cantidad de información clínica disponible en formato digital ha crecido constantemente debido a la adopción del uso de sistemas de informática médica. En la mayoría de los casos, dicha información se encuentra representada en forma textual. La extracción de información contenida en dichos textos puede utilizarse para colaborar en tareas relacionadas con la clínica médica y para la toma de decisiones, y resulta esencial para la mejora de la atención médica. El dominio biomédico tiene vocabulario altamente especializado, local a distintos países, regiones e instituciones. Se utilizan abreviaturas ambiguas y no estándares. Por otro lado, algunos tipos de informes médicos suelen presentar faltas ortográficas y errores gramaticales. Además, la cantidad de datos anotados disponibles es escasa, debido a la dificultad de obtenerlos y a temas relacionados con la confidencialidad de la información. Esta situación dificulta el avance en el área de extracción de información. Pese a ser el segundo idioma con mayor cantidad de hablantes nativos en el mundo, poco trabajo se ha realizado hasta ahora en extracción de información de informes médicos escritos en espa˜nol. A los desafíos anteriormente descriptos se agregan la ausencia de terminologías específicas para ciertos dominios médicos y la menor disponibilidad de recursos linguísticos que los existentes para otros idiomas. En este trabajo contribuimos al dominio de la biomedicina en espa˜nol, proveyendo métodos con resultados competitivos para el desarrollo de componentes fundamentales de un proceso de extracción de información médico, específicamente para informes radiológicos. Con este fin, creamos un corpus anotado de informes radiológicos en espa˜nol para el reconocimiento de entidades, negación y especulación y extracción de relaciones. Publicamos el proceso seguido para la anotación y el esquema desarrollado. Implementamos dos algoritmos de detección de entidades nombradas con el fin de encontrar entidades anatómicas y hallazgos clínicos. El primero está basado en un diccionario especializado del dominio no disponible en espa˜nol y en el uso de reglas basadas en conocimiento morfosintáctico y está pensado para trabajar con lenguajes sin muchos recursos linguísticos. El segundo está basado en campos aleatorios condicionales y arroja mejores resultados. Adicionalmente, estudiamos e implementamos distintas soluciones para la detección de hallazgos clínicos negados. Para esto, adaptamos al espa˜nol un conocido algoritmo de detección de negaciones en textos médicos escritos en inglés y desarrollamos un método basado en reglas creadas a partir de patrones inferidos del análisis de caminos en árboles de dependencias. También adaptamos el primer método, que arrojó los mejores resultados, para la detección de negación y especulación en resúmenes de alta hospitalaria y notas de evolución clínica escritos en alemán. Consideramos que los resultados obtenidos y la publicación de criterios de anotación y evaluación contribuirán a seguir avanzando en la extracción de información de informes clínicos escritos en espa˜nol.In the last years, the number of digitized clinical data has been growing steadily, due to the adoption of clinical information systems. A great amount of this data is in textual format. The extraction of information contained in texts can be used to support clinical tasks and decisions and is essential for improving health care. The biomedical domain uses a highly specialized and local vocabulary, with abundance of non-standard and ambiguous abbreviations. Moreover, some type of medical reports present ill-formed sentences and lack of diacritics. Publicly accessible annotated data is scarce, due to two main reasons: the difficulty of creating it and the confidential nature of the data, that demands de-identification. This situation hinders the advance of information extraction in the biomedical domain area. Although Spanish is the second language in terms of numbers of native speakers in the world, not much work has been done in information extraction from Spanish medical reports. Challenges include the absence of specific terminologies for certain medical domains in Spanish and the availability of linguistic resources, that are less developed than those of high resources languages, such as English. In this thesis, we contribute to the BioNLP domain by providing methods with competitive results to apply a fragment of a medical information extraction pipeline to Spanish radiology reports. Therefore, an annotated dataset for entity recognition, negation and speculation detection, and relation extraction was created. The annotation process followed and the annotation schema developed were shared with the community. Two named entity recognition algorithms were implemented for the detection of anatomical entities and clinical findings. The first algorithm developed is based on a specialized dictionary of the radiology domain not available in Spanish and in the use of rules based on morphosyntactic knowledge and is designed for named entity recognition in medium or low resource languages. The second one, based on conditional random fields, was implemented when we were able to obtain a larger set of annotated data and achieves better results. We also studied and implemented different solutions for negation detection of clinical findings: an adaptation to Spanish of a popular negation detection algorithm for English medical reports and a rule-based method that detects negations based on patterns inferred from the analysis of paths of dependency parse trees. The first method obtained the best results and was also adapted for negation and speculation detection in German clinical notes and discharge summaries. We consider that the results obtained, and the annotation guidelines provided will bring new benefits to further advance in the field of information extraction from Spanish medical reports.Fil: Cotik, Viviana Erica. Universidad de Buenos Aires. Facultad de Ciencias Exactas y Naturales; Argentina

    Front-Line Physicians' Satisfaction with Information Systems in Hospitals

    Get PDF
    Day-to-day operations management in hospital units is difficult due to continuously varying situations, several actors involved and a vast number of information systems in use. The aim of this study was to describe front-line physicians' satisfaction with existing information systems needed to support the day-to-day operations management in hospitals. A cross-sectional survey was used and data chosen with stratified random sampling were collected in nine hospitals. Data were analyzed with descriptive and inferential statistical methods. The response rate was 65 % (n = 111). The physicians reported that information systems support their decision making to some extent, but they do not improve access to information nor are they tailored for physicians. The respondents also reported that they need to use several information systems to support decision making and that they would prefer one information system to access important information. Improved information access would better support physicians' decision making and has the potential to improve the quality of decisions and speed up the decision making process.Peer reviewe

    DETAILED CLINICAL MODELS AND THEIR RELATION WITH ELECTRONIC HEALTH RECORDS

    Full text link
    Tesis por compendio[EN] Healthcare domain produces and consumes big quantities of people's health data. Although data exchange is the norm rather than the exception, being able to access to all patient data is still far from achieved. Current developments such as personal health records will introduce even more data and complexity to the Electronic Health Records (EHR). Achieving semantic interoperability is one of the biggest challenges to overcome in order to benefit from all the information contained in the distributed EHR. This requires that the semantics of the information can be understood by all involved parties. It has been stablished that three layers are needed to achieve semantic interoperability: Reference models, clinical models (archetypes), and clinical terminologies. As seen in the literature, information models (reference models and clinical models) are lacking methodologies and tools to improve EHR systems and to develop new systems that can be semantically interoperable. The purpose of this thesis is to provide methodologies and tools for advancing the use of archetypes in three different scenarios: - Archetype definition over specifications with no dual model architecture native support. Any EHR architecture that directly or indirectly has the notion of detailed clinical models (such as HL7 CDA templates) can be potentially used as a reference model for archetype definition. This allows transforming single-model architectures (which contain only a reference model) into dual-model architectures (reference model with archetypes). A set of methodologies and tools has been developed to support the definition of archetypes from multiple reference models. - Data transformation. A complete methodology and tools are proposed to deal with the transformation of legacy data into XML documents compliant with the archetype and the underlying reference model. If the reference model is a standard then the transformation is a standardization process. The methodologies and tools allow both the transformation of legacy data and the transformation of data between different EHR standards. - Automatic generation of implementation guides and reference materials from archetypes. A methodology for the automatic generation of a set of reference materials is provided. These materials are useful for the development and use of EHR systems. These reference materials include data validators, example instances, implementation guides, human-readable formal rules, sample forms, mindmaps, etc. These reference materials can be combined and organized in different ways to adapt to different types of users (clinical or information technology staff). This way, users can include the detailed clinical model in their organization workflow and cooperate in the model definition. These methodologies and tools put clinical models as a key part of the system. The set of presented methodologies and tools ease the achievement of semantic interoperability by providing means for the semantic description, normalization, and validation of existing and new systems.[ES] El sector sanitario produce y consume una gran cantidad de datos sobre la salud de las personas. La necesidad de intercambiar esta información es una norma más que una excepción, aunque este objetivo está lejos de ser alcanzado. Actualmente estamos viviendo avances como la medicina personalizada que incrementarán aún más el tamaño y complejidad de la Historia Clínica Electrónica (HCE). La consecución de altos grados de interoperabilidad semántica es uno de los principales retos para aprovechar al máximo toda la información contenida en las HCEs. Esto a su vez requiere una representación fiel de la información de tal forma que asegure la consistencia de su significado entre todos los agentes involucrados. Actualmente está reconocido que para la representación del significado clínico necesitamos tres tipos de artefactos: modelos de referencia, modelos clínicos (arquetipos) y terminologías. En el caso concreto de los modelos de información (modelos de referencia y modelos clínicos) se observa en la literatura una falta de metodologías y herramientas que faciliten su uso tanto para la mejora de sistemas de HCE ya existentes como en el desarrollo de nuevos sistemas con altos niveles de interoperabilidad semántica. Esta tesis tiene como propósito proporcionar metodologías y herramientas para el uso avanzado de arquetipos en tres escenarios diferentes: - Definición de arquetipos sobre especificaciones sin soporte nativo al modelo dual. Cualquier arquitectura de HCE que posea directa o indirectamente la noción de modelos clínicos detallados (por ejemplo, las plantillas en HL7 CDA) puede ser potencialmente usada como modelo de referencia para la definición de arquetipos. Con esto se consigue transformar arquitecturas de HCE de modelo único (solo con modelo de referencia) en arquitecturas de doble modelo (modelo de referencia + arquetipos). Se han desarrollado metodologías y herramientas que faciliten a los editores de arquetipos el soporte a múltiples modelos de referencia. - Transformación de datos. Se propone una metodología y herramientas para la transformación de datos ya existentes a documentos XML conformes con los arquetipos y el modelo de referencia subyacente. Si el modelo de referencia es un estándar entonces la transformación será un proceso de estandarización de datos. La metodología y herramientas permiten tanto la transformación de datos no estandarizados como la transformación de datos entre diferentes estándares. - Generación automática de guías de implementación y artefactos procesables a partir de arquetipos. Se aporta una metodología para la generación automática de un conjunto de materiales de referencia de utilidad en el desarrollo y uso de sistemas de HCE, concretamente validadores de datos, instancias de ejemplo, guías de implementación , reglas formales legibles por humanos, formularios de ejemplo, mindmaps, etc. Estos materiales pueden ser combinados y organizados de diferentes modos para facilitar que los diferentes tipos de usuarios (clínicos, técnicos) puedan incluir los modelos clínicos detallados en el flujo de trabajo de su sistema y colaborar en su definición. Estas metodologías y herramientas ponen los modelos clínicos como una parte clave en el sistema. El conjunto de las metodologías y herramientas presentadas facilitan la consecución de la interoperabilidad semántica al proveer medios para la descripción semántica, normalización y validación tanto de sistemas nuevos como ya existentes.[CA] El sector sanitari produeix i consumeix una gran quantitat de dades sobre la salut de les persones. La necessitat d'intercanviar aquesta informació és una norma més que una excepció, encara que aquest objectiu està lluny de ser aconseguit. Actualment estem vivint avanços com la medicina personalitzada que incrementaran encara més la grandària i complexitat de la Història Clínica Electrònica (HCE). La consecució d'alts graus d'interoperabilitat semàntica és un dels principals reptes per a aprofitar al màxim tota la informació continguda en les HCEs. Açò, per la seua banda, requereix una representació fidel de la informació de tal forma que assegure la consistència del seu significat entre tots els agents involucrats. Actualment està reconegut que per a la representació del significat clínic necessitem tres tipus d'artefactes: models de referència, models clínics (arquetips) i terminologies. En el cas concret dels models d'informació (models de referència i models clínics) s'observa en la literatura una mancança de metodologies i eines que en faciliten l'ús tant per a la millora de sistemes de HCE ja existents com per al desenvolupament de nous sistemes amb alts nivells d'interoperabilitat semàntica. Aquesta tesi té com a propòsit proporcionar metodologies i eines per a l'ús avançat d'arquetips en tres escenaris diferents: - Definició d'arquetips sobre especificacions sense suport natiu al model dual. Qualsevol arquitectura de HCE que posseïsca directa o indirectament la noció de models clínics detallats (per exemple, les plantilles en HL7 CDA) pot ser potencialment usada com a model de referència per a la definició d'arquetips. Amb açò s'aconsegueix transformar arquitectures de HCE de model únic (solament amb model de referència) en arquitectures de doble model (model de referència + arquetips). S'han desenvolupat metodologies i eines que faciliten als editors d'arquetips el suport a múltiples models de referència. - Transformació de dades. Es proposa una metodologia i eines per a la transformació de dades ja existents a documents XML conformes amb els arquetips i el model de referència subjacent. Si el model de referència és un estàndard llavors la transformació serà un procés d'estandardització de dades. La metodologia i eines permeten tant la transformació de dades no estandarditzades com la transformació de dades entre diferents estàndards. - Generació automàtica de guies d'implementació i artefactes processables a partir d'arquetips. S'hi inclou una metodologia per a la generació automàtica d'un conjunt de materials de referència d'utilitat en el desenvolupament i ús de sistemes de HCE, concretament validadors de dades, instàncies d'exemple, guies d'implementació, regles formals llegibles per humans, formularis d'exemple, mapes mentals, etc. Aquests materials poden ser combinats i organitzats de diferents maneres per a facilitar que els diferents tipus d'usuaris (clínics, tècnics) puguen incloure els models clínics detallats en el flux de treball del seu sistema i col·laborar en la seua definició. Aquestes metodologies i eines posen els models clínics com una part clau del sistemes. El conjunt de les metodologies i eines presentades faciliten la consecució de la interoperabilitat semàntica en proveir mitjans per a la seua descripció semàntica, normalització i validació tant de sistemes nous com ja existents.Boscá Tomás, D. (2016). DETAILED CLINICAL MODELS AND THEIR RELATION WITH ELECTRONIC HEALTH RECORDS [Tesis doctoral no publicada]. Universitat Politècnica de València. https://doi.org/10.4995/Thesis/10251/62174TESISCompendi

    Neural Representations of Concepts and Texts for Biomedical Information Retrieval

    Get PDF
    Information retrieval (IR) methods are an indispensable tool in the current landscape of exponentially increasing textual data, especially on the Web. A typical IR task involves fetching and ranking a set of documents (from a large corpus) in terms of relevance to a user\u27s query, which is often expressed as a short phrase. IR methods are the backbone of modern search engines where additional system-level aspects including fault tolerance, scale, user interfaces, and session maintenance are also addressed. In addition to fetching documents, modern search systems may also identify snippets within the documents that are potentially most relevant to the input query. Furthermore, current systems may also maintain preprocessed structured knowledge derived from textual data as so called knowledge graphs, so certain types of queries that are posed as questions can be parsed as such; a response can be an output of one or more named entities instead of a ranked list of documents (e.g., what diseases are associated with EGFR mutations? ). This refined setup is often termed as question answering (QA) in the IR and natural language processing (NLP) communities. In biomedicine and healthcare, specialized corpora are often at play including research articles by scientists, clinical notes generated by healthcare professionals, consumer forums for specific conditions (e.g., cancer survivors network), and clinical trial protocols (e.g., www.clinicaltrials.gov). Biomedical IR is specialized given the types of queries and the variations in the texts are different from that of general Web documents. For example, scientific articles are more formal with longer sentences but clinical notes tend to have less grammatical conformity and are rife with abbreviations. There is also a mismatch between the vocabulary of consumers and the lingo of domain experts and professionals. Queries are also different and can range from simple phrases (e.g., COVID-19 symptoms ) to more complex implicitly fielded queries (e.g., chemotherapy regimens for stage IV lung cancer patients with ALK mutations ). Hence, developing methods for different configurations (corpus, query type, user type) needs more deliberate attention in biomedical IR. Representations of documents and queries are at the core of IR methods and retrieval methodology involves coming up with these representations and matching queries with documents based on them. Traditional IR systems follow the approach of keyword based indexing of documents (the so called inverted index) and matching query phrases against the document index. It is not difficult to see that this keyword based matching ignores the semantics of texts (synonymy at the lexeme level and entailment at phrase/clause/sentence levels) and this has lead to dimensionality reduction methods such as latent semantic indexing that generally have scale-related concerns; such methods also do not address similarity at the sentence level. Since the resurgence of neural network methods in NLP, the IR field has also moved to incorporate advances in neural networks into current IR methods. This dissertation presents four specific methodological efforts toward improving biomedical IR. Neural methods always begin with dense embeddings for words and concepts to overcome the limitations of one-hot encoding in traditional NLP/IR. In the first effort, we present a new neural pre-training approach to jointly learn word and concept embeddings for downstream use in applications. In the second study, we present a joint neural model for two essential subtasks of information extraction (IE): named entity recognition (NER) and entity normalization (EN). Our method detects biomedical concept phrases in texts and links them to the corresponding semantic types and entity codes. These first two studies provide essential tools to model textual representations as compositions of both surface forms (lexical units) and high level concepts with potential downstream use in QA. In the third effort, we present a document reranking model that can help surface documents that are likely to contain answers (e.g, factoids, lists) to a question in a QA task. The model is essentially a sentence matching neural network that learns the relevance of a candidate answer sentence to the given question parametrized with a bilinear map. In the fourth effort, we present another document reranking approach that is tailored for precision medicine use-cases. It combines neural query-document matching and faceted text summarization. The main distinction of this effort from previous efforts is to pivot from a query manipulation setup to transforming candidate documents into pseudo-queries via neural text summarization. Overall, our contributions constitute nontrivial advances in biomedical IR using neural representations of concepts and texts
    corecore