11 research outputs found

    Contributions to information extraction for spanish written biomedical text

    Get PDF
    285 p.Healthcare practice and clinical research produce vast amounts of digitised, unstructured data in multiple languages that are currently underexploited, despite their potential applications in improving healthcare experiences, supporting trainee education, or enabling biomedical research, for example. To automatically transform those contents into relevant, structured information, advanced Natural Language Processing (NLP) mechanisms are required. In NLP, this task is known as Information Extraction. Our work takes place within this growing field of clinical NLP for the Spanish language, as we tackle three distinct problems. First, we compare several supervised machine learning approaches to the problem of sensitive data detection and classification. Specifically, we study the different approaches and their transferability in two corpora, one synthetic and the other authentic. Second, we present and evaluate UMLSmapper, a knowledge-intensive system for biomedical term identification based on the UMLS Metathesaurus. This system recognises and codifies terms without relying on annotated data nor external Named Entity Recognition tools. Although technically naive, it performs on par with more evolved systems, and does not exhibit a considerable deviation from other approaches that rely on oracle terms. Finally, we present and exploit a new corpus of real health records manually annotated with negation and uncertainty information: NUBes. This corpus is the basis for two sets of experiments, one on cue andscope detection, and the other on assertion classification. Throughout the thesis, we apply and compare techniques of varying levels of sophistication and novelty, which reflects the rapid advancement of the field

    Automatic Discovery of Heterogeneous Machine Learning Pipelines: An Application to Natural Language Processing

    Get PDF
    This paper presents AutoGOAL, a system for automatic machine learning (AutoML) that uses heterogeneous techniques. In contrast with existing AutoML approaches, our contribution can automatically build machine learning pipelines that combine techniques and algorithms from different frameworks, including shallow classifiers, natural language processing tools, and neural networks. We define the heterogeneous AutoML optimization problem as the search for the best sequence of algorithms that transforms specific input data into the desired output. This provides a novel theoretical and practical approach to AutoML. Our proposal is experimentally evaluated in diverse machine learning problems and compared with alternative approaches, showing that it is competitive with other AutoML alternatives in standard benchmarks. Furthermore, it can be applied to novel scenarios, such as several NLP tasks, where existing alternatives cannot be directly deployed. The system is freely available and includes in-built compatibility with a large number of popular machine learning frameworks, which makes our approach useful for solving practical problems with relative ease and effort.This research has been supported by a Carolina Foundation grant in agreement with University of Alicante and University of Havana. Moreover, it has also been partially funded by both aforementioned universities, the Generalitat Valenciana (Conselleria d鈥橢ducaci贸, Investigaci贸, Cultura i Esport) and the Spanish Government through the projects LIVING-LANG (RTI2018-094653-B-C22) and SIIA (PROMETEO/2018/089, PROMETEU/2018/089)

    Utility-Preserving Anonymization of Textual Documents

    Get PDF
    Cada dia els 茅ssers humans afegim una gran quantitat de dades a Internet, tals com piulades, opinions, fotos i v铆deos. Les organitzacions que recullen aquestes dades tan diverses n'extreuen informaci贸 per tal de millorar llurs serveis o b茅 per a prop貌sits comercials. Tanmateix, si les dades recollides contenen informaci贸 personal sensible, hom no les pot compartir amb tercers ni les pot publicar sense el consentiment o una protecci贸 adequada dels subjectes de les dades. Els mecanismes de preservaci贸 de la privadesa forneixen maneres de sanejar les dades per tal que no revelin identitats o atributs confidencials. S'ha proposat una gran varietat de mecanismes per anonimitzar bases de dades estructurades amb atributs num猫rics i categ貌rics; en canvi, la protecci贸 autom脿tica de dades textuals no estructurades ha rebut molta menys atenci贸. En general, l'anonimitzaci贸 de dades textuals exigeix, primer, detectar trossos del text que poden revelar informaci贸 sensible i, despr茅s, emmascarar aquests trossos mitjan莽ant supressi贸 o generalitzaci贸. En aquesta tesi fem servir diverses tecnologies per anonimitzar documents textuals. De primer, millorem les t猫cniques existents basades en etiquetatge de seq眉猫ncies. Despr茅s, estenem aquestes t猫cniques per alinear-les millor amb el risc de revelaci贸 i amb les exig猫ncies de privadesa. Finalment, proposem un marc complet basat en models d'immersi贸 de paraules que captura un concepte m茅s ampli de protecci贸 de dades i que forneix una protecci贸 flexible guiada per les exig猫ncies de privadesa. Tamb茅 recorrem a les ontologies per preservar la utilitat del text emmascarat, 茅s a dir, la seva sem脿ntica i la seva llegibilitat. La nostra experimentaci贸 extensa i detallada mostra que els nostres m猫todes superen els m猫todes existents a l'hora de proporcionar anonimitzaci贸 robusta tot preservant raonablement la utilitat del text protegit.Cada d铆a las personas a帽adimos una gran cantidad de datos a Internet, tales como tweets, opiniones, fotos y v铆deos. Las organizaciones que recogen dichos datos los usan para extraer informaci贸n para mejorar sus servicios o para prop贸sitos comerciales. Sin embargo, si los datos recogidos contienen informaci贸n personal sensible, no pueden compartirse ni publicarse sin el consentimiento o una protecci贸n adecuada de los sujetos de los datos. Los mecanismos de protecci贸n de la privacidad proporcionan maneras de sanear los datos de forma que no revelen identidades ni atributos confidenciales. Se ha propuesto una gran variedad de mecanismos para anonimizar bases de datos estructuradas con atributos num茅ricos y categ贸ricos; en cambio, la protecci贸n autom谩tica de datos textuales no estructurados ha recibido mucha menos atenci贸n. En general, la anonimizaci贸n de datos textuales requiere, primero, detectar trozos de texto que puedan revelar informaci贸n sensible, para luego enmascarar dichos trozos mediante supresi贸n o generalizaci贸n. En este trabajo empleamos varias tecnolog铆as para anonimizar documentos textuales. Primero mejoramos las t茅cnicas existentes basadas en etiquetaje de secuencias. Posteriormente las extendmos para alinearlas mejor con la noci贸n de riesgo de revelaci贸n y con los requisitos de privacidad. Finalmente, proponemos un marco completo basado en modelos de inmersi贸n de palabras que captura una noci贸n m谩s amplia de protecci贸n de datos y ofrece protecci贸n flexible guiada por los requisitos de privacidad. Tambi茅n recurrimos a las ontolog铆as para preservar la utilidad del texto enmascarado, es decir, su semantica y legibilidad. Nuestra experimentaci贸n extensa y detallada muestra que nuestros m茅todos superan a los existentes a la hora de proporcionar una anonimizaci贸n m谩s robusta al tiempo que se preserva razonablemente la utilidad del texto protegido.Every day, people post a significant amount of data on the Internet, such as tweets, reviews, photos, and videos. Organizations collecting these types of data use them to extract information in order to improve their services or for commercial purposes. Yet, if the collected data contain sensitive personal information, they cannot be shared with third parties or released publicly without consent or adequate protection of the data subjects. Privacy-preserving mechanisms provide ways to sanitize data so that identities and/or confidential attributes are not disclosed. A great variety of mechanisms have been proposed to anonymize structured databases with numerical and categorical attributes; however, automatically protecting unstructured textual data has received much less attention. In general, textual data anonymization requires, first, to detect pieces of text that may disclose sensitive information and, then, to mask those pieces via suppression or generalization. In this work, we leverage several technologies to anonymize textual documents. We first improve state-of-the-art techniques based on sequence labeling. After that, we extend them to make them more aligned with the notion of privacy risk and the privacy requirements. Finally, we propose a complete framework based on word embedding models that captures a broader notion of data protection and provides flexible protection driven by privacy requirements. We also leverage ontologies to preserve the utility of the masked text, that is, its semantics and readability. Extensive experimental results show that our methods outperform the state of the art by providing more robust anonymization while reasonably preserving the utility of the protected outcome

    Robust input representations for low-resource information extraction

    Get PDF
    Recent advances in the field of natural language processing were achieved with deep learning models. This led to a wide range of new research questions concerning the stability of such large-scale systems and their applicability beyond well-studied tasks and datasets, such as information extraction in non-standard domains and languages, in particular, in low-resource environments. In this work, we address these challenges and make important contributions across fields such as representation learning and transfer learning by proposing novel model architectures and training strategies to overcome existing limitations, including a lack of training resources, domain mismatches and language barriers. In particular, we propose solutions to close the domain gap between representation models by, e.g., domain-adaptive pre-training or our novel meta-embedding architecture for creating a joint representations of multiple embedding methods. Our broad set of experiments demonstrates state-of-the-art performance of our methods for various sequence tagging and classification tasks and highlight their robustness in challenging low-resource settings across languages and domains.Die j眉ngsten Fortschritte auf dem Gebiet der Verarbeitung nat眉rlicher Sprache wurden mit Deep-Learning-Modellen erzielt. Dies f眉hrte zu einer Vielzahl neuer Forschungsfragen bez眉glich der Stabilit盲t solcher gro脽en Systeme und ihrer Anwendbarkeit 眉ber gut untersuchte Aufgaben und Datens盲tze hinaus, wie z. B. die Informationsextraktion f眉r Nicht-Standardsprachen, aber auch Textdom盲nen und Aufgaben, f眉r die selbst im Englischen nur wenige Trainingsdaten zur Verf眉gung stehen. In dieser Arbeit gehen wir auf diese Herausforderungen ein und leisten wichtige Beitr盲ge in Bereichen wie Repr盲sentationslernen und Transferlernen, indem wir neuartige Modellarchitekturen und Trainingsstrategien vorschlagen, um bestehende Beschr盲nkungen zu 眉berwinden, darunter fehlende Trainingsressourcen, ungesehene Dom盲nen und Sprachbarrieren. Insbesondere schlagen wir L枚sungen vor, um die Dom盲nenl眉cke zwischen Repr盲sentationsmodellen zu schlie脽en, z.B. durch dom盲nenadaptives Vortrainieren oder unsere neuartige Meta-Embedding-Architektur zur Erstellung einer gemeinsamen Repr盲sentation mehrerer Embeddingmethoden. Unsere umfassende Evaluierung demonstriert die Leistungsf盲higkeit unserer Methoden f眉r verschiedene Klassifizierungsaufgaben auf Word und Satzebene und unterstreicht ihre Robustheit in anspruchsvollen, ressourcenarmen Umgebungen in verschiedenen Sprachen und Dom盲nen

    A computational ecosystem to support eHealth Knowledge Discovery technologies in Spanish

    Get PDF
    The massive amount of biomedical information published online requires the development of automatic knowledge discovery technologies to effectively make use of this available content. To foster and support this, the research community creates linguistic resources, such as annotated corpora, and designs shared evaluation campaigns and academic competitive challenges. This work describes an ecosystem that facilitates research and development in knowledge discovery in the biomedical domain, specifically in Spanish language. To this end, several resources are developed and shared with the research community, including a novel semantic annotation model, an annotated corpus of 1045 sentences, and computational resources to build and evaluate automatic knowledge discovery techniques. Furthermore, a research task is defined with objective evaluation criteria, and an online evaluation environment is setup and maintained, enabling researchers interested in this task to obtain immediate feedback and compare their results with the state-of-the-art. As a case study, we analyze the results of a competitive challenge based on these resources and provide guidelines for future research. The constructed ecosystem provides an effective learning and evaluation environment to encourage research in knowledge discovery in Spanish biomedical documents.This research has been partially supported by the University of Alicante and University of Havana, the Generalitat Valenciana (Conselleria d鈥橢ducaci贸, Investigaci贸, Cultura i Esport) and the Spanish Government through the projects SIIA (PROMETEO/2018/089, PROMETEU/2018/089) and LIVING-LANG (RTI2018-094653-B-C22)

    Safeguarding Privacy Through Deep Learning Techniques

    Get PDF
    Over the last few years, there has been a growing need to meet minimum security and privacy requirements. Both public and private companies have had to comply with increasingly stringent standards, such as the ISO 27000 family of standards, or the various laws governing the management of personal data. The huge amount of data to be managed has required a huge effort from the employees who, in the absence of automatic techniques, have had to work tirelessly to achieve the certification objectives. Unfortunately, due to the delicate information contained in the documentation relating to these problems, it is difficult if not impossible to obtain material for research and study purposes on which to experiment new ideas and techniques aimed at automating processes, perhaps exploiting what is in ferment in the scientific community and linked to the fields of ontologies and artificial intelligence for data management. In order to bypass this problem, it was decided to examine data related to the medical world, which, especially for important reasons related to the health of individuals, have gradually become more and more freely accessible over time, without affecting the generality of the proposed methods, which can be reapplied to the most diverse fields in which there is a need to manage privacy-sensitive information

    On the Use of Parsing for Named Entity Recognition

    Get PDF
    [Abstract] Parsing is a core natural language processing technique that can be used to obtain the structure underlying sentences in human languages. Named entity recognition (NER) is the task of identifying the entities that appear in a text. NER is a challenging natural language processing task that is essential to extract knowledge from texts in multiple domains, ranging from financial to medical. It is intuitive that the structure of a text can be helpful to determine whether or not a certain portion of it is an entity and if so, to establish its concrete limits. However, parsing has been a relatively little-used technique in NER systems, since most of them have chosen to consider shallow approaches to deal with text. In this work, we study the characteristics of NER, a task that is far from being solved despite its long history; we analyze the latest advances in parsing that make its use advisable in NER settings; we review the different approaches to NER that make use of syntactic information; and we propose a new way of using parsing in NER based on casting parsing itself as a sequence labeling task.Xunta de Galicia; ED431C 2020/11Xunta de Galicia; ED431G 2019/01This work has been funded by MINECO, AEI and FEDER of UE through the ANSWER-ASAP project (TIN2017-85160-C2-1-R); and by Xunta de Galicia through a Competitive Reference Group grant (ED431C 2020/11). CITIC, as Research Center of the Galician University System, is funded by the Conseller铆a de Educaci贸n, Universidade e Formaci贸n Profesional of the Xunta de Galicia through the European Regional Development Fund (ERDF/FEDER) with 80%, the Galicia ERDF 2014-20 Operational Programme, and the remaining 20% from the Secretar铆a Xeral de Universidades (Ref. ED431G 2019/01). Carlos G贸mez-Rodr铆guez has also received funding from the European Research Council (ERC), under the European Union鈥檚 Horizon 2020 research and innovation programme (FASTPARSE, Grant No. 714150)

    Using machine learning for automated de-identification and clinical coding of free text data in electronic medical records

    Full text link
    The widespread adoption of Electronic Medical Records (EMRs) in hospitals continues to increase the amount of patient data that are digitally stored. Although the primary use of the EMR is to support patient care by making all relevant information accessible, governments and health organisations are looking for ways to unleash the potential of these data for secondary purposes, including clinical research, disease surveillance and automation of healthcare processes and workflows. EMRs include large quantities of free text documents that contain valuable information. The greatest challenges in using the free text data in EMRs include the removal of personally identifiable information and the extraction of relevant information for specific tasks such as clinical coding. Machine learning-based automated approaches can potentially address these challenges. This thesis aims to explore and improve the performance of machine learning models for automated de-identification and clinical coding of free text data in EMRs, as captured in hospital discharge summaries, and facilitate the applications of these approaches in real-world use cases. It does so by 1) implementing an end-to-end de-identification framework using an ensemble of deep learning models; 2) developing a web-based system for de-identification of free text (DEFT) with an interactive learning loop; 3) proposing and implementing a hierarchical label-wise attention transformer model (HiLAT) for explainable International Classification of Diseases (ICD) coding; and 4) investigating the use of extreme multi-label long text transformer-based models for automated ICD coding. The key findings include: 1) An end-to-end framework using an ensemble of deep learning base-models achieved excellent performance on the de-identification task. 2) A new web-based de-identification software system (DEFT) can be readily and easily adopted by data custodians and researchers to perform de-identification of free text in EMRs. 3) A novel domain-specific transformer-based model (HiLAT) achieved state-of-the-art (SOTA) results for predicting ICD codes on a Medical Information Mart for Intensive Care (MIMIC-III) dataset comprising the discharge summaries (n=12,808) that are coded with at least one of the most 50 frequent diagnosis and procedure codes. In addition, the label-wise attention scores for the tokens in the discharge summary presented a potential explainability tool for checking the face validity of ICD code predictions. 4) An optimised transformer-based model, PLM-ICD, achieved the latest SOTA results for ICD coding on all the discharge summaries of the MIMIC-III dataset (n=59,652). The segmentation method, which split the long text consecutively into multiple small chunks, addressed the problem of applying transformer-based models to long text datasets. However, using transformer-based models on extremely large label sets needs further research. These findings demonstrate that the de-identification and clinical coding tasks can benefit from the application of machine learning approaches, present practical tools for implementing these approaches, and highlight priorities for further research
    corecore