5,396 research outputs found

    Utility-Preserving Anonymization of Textual Documents

    Get PDF
    Cada dia els éssers humans afegim una gran quantitat de dades a Internet, tals com piulades, opinions, fotos i vídeos. Les organitzacions que recullen aquestes dades tan diverses n'extreuen informació per tal de millorar llurs serveis o bé per a propòsits comercials. Tanmateix, si les dades recollides contenen informació personal sensible, hom no les pot compartir amb tercers ni les pot publicar sense el consentiment o una protecció adequada dels subjectes de les dades. Els mecanismes de preservació de la privadesa forneixen maneres de sanejar les dades per tal que no revelin identitats o atributs confidencials. S'ha proposat una gran varietat de mecanismes per anonimitzar bases de dades estructurades amb atributs numèrics i categòrics; en canvi, la protecció automàtica de dades textuals no estructurades ha rebut molta menys atenció. En general, l'anonimització de dades textuals exigeix, primer, detectar trossos del text que poden revelar informació sensible i, després, emmascarar aquests trossos mitjançant supressió o generalització. En aquesta tesi fem servir diverses tecnologies per anonimitzar documents textuals. De primer, millorem les tècniques existents basades en etiquetatge de seqüències. Després, estenem aquestes tècniques per alinear-les millor amb el risc de revelació i amb les exigències de privadesa. Finalment, proposem un marc complet basat en models d'immersió de paraules que captura un concepte més ampli de protecció de dades i que forneix una protecció flexible guiada per les exigències de privadesa. També recorrem a les ontologies per preservar la utilitat del text emmascarat, és a dir, la seva semàntica i la seva llegibilitat. La nostra experimentació extensa i detallada mostra que els nostres mètodes superen els mètodes existents a l'hora de proporcionar anonimització robusta tot preservant raonablement la utilitat del text protegit.Cada día las personas añadimos una gran cantidad de datos a Internet, tales como tweets, opiniones, fotos y vídeos. Las organizaciones que recogen dichos datos los usan para extraer información para mejorar sus servicios o para propósitos comerciales. Sin embargo, si los datos recogidos contienen información personal sensible, no pueden compartirse ni publicarse sin el consentimiento o una protección adecuada de los sujetos de los datos. Los mecanismos de protección de la privacidad proporcionan maneras de sanear los datos de forma que no revelen identidades ni atributos confidenciales. Se ha propuesto una gran variedad de mecanismos para anonimizar bases de datos estructuradas con atributos numéricos y categóricos; en cambio, la protección automática de datos textuales no estructurados ha recibido mucha menos atención. En general, la anonimización de datos textuales requiere, primero, detectar trozos de texto que puedan revelar información sensible, para luego enmascarar dichos trozos mediante supresión o generalización. En este trabajo empleamos varias tecnologías para anonimizar documentos textuales. Primero mejoramos las técnicas existentes basadas en etiquetaje de secuencias. Posteriormente las extendmos para alinearlas mejor con la noción de riesgo de revelación y con los requisitos de privacidad. Finalmente, proponemos un marco completo basado en modelos de inmersión de palabras que captura una noción más amplia de protección de datos y ofrece protección flexible guiada por los requisitos de privacidad. También recurrimos a las ontologías para preservar la utilidad del texto enmascarado, es decir, su semantica y legibilidad. Nuestra experimentación extensa y detallada muestra que nuestros métodos superan a los existentes a la hora de proporcionar una anonimización más robusta al tiempo que se preserva razonablemente la utilidad del texto protegido.Every day, people post a significant amount of data on the Internet, such as tweets, reviews, photos, and videos. Organizations collecting these types of data use them to extract information in order to improve their services or for commercial purposes. Yet, if the collected data contain sensitive personal information, they cannot be shared with third parties or released publicly without consent or adequate protection of the data subjects. Privacy-preserving mechanisms provide ways to sanitize data so that identities and/or confidential attributes are not disclosed. A great variety of mechanisms have been proposed to anonymize structured databases with numerical and categorical attributes; however, automatically protecting unstructured textual data has received much less attention. In general, textual data anonymization requires, first, to detect pieces of text that may disclose sensitive information and, then, to mask those pieces via suppression or generalization. In this work, we leverage several technologies to anonymize textual documents. We first improve state-of-the-art techniques based on sequence labeling. After that, we extend them to make them more aligned with the notion of privacy risk and the privacy requirements. Finally, we propose a complete framework based on word embedding models that captures a broader notion of data protection and provides flexible protection driven by privacy requirements. We also leverage ontologies to preserve the utility of the masked text, that is, its semantics and readability. Extensive experimental results show that our methods outperform the state of the art by providing more robust anonymization while reasonably preserving the utility of the protected outcome

    A Survey on Deep Learning in Medical Image Analysis

    Full text link
    Deep learning algorithms, in particular convolutional networks, have rapidly become a methodology of choice for analyzing medical images. This paper reviews the major deep learning concepts pertinent to medical image analysis and summarizes over 300 contributions to the field, most of which appeared in the last year. We survey the use of deep learning for image classification, object detection, segmentation, registration, and other tasks and provide concise overviews of studies per application area. Open challenges and directions for future research are discussed.Comment: Revised survey includes expanded discussion section and reworked introductory section on common deep architectures. Added missed papers from before Feb 1st 201

    End to end approach for i2b2 2012 challenge based on Cross-lingual models

    Get PDF
    BACKGROUND - We propose a Cross-lingual approach to i2b2 2012 challenge for Clinical Records focused on the temporal relations in clinical narratives. Corpus of discharge summaries annotated with temporal information was provided for automatically extracting : (1) clinically significant events, including both clinical concepts such as problems, tests, treatments, and clinical departments, and events relevant to the patient’s clinical timeline, such as admissions, transfers between departments, etc; (2) temporal expressions, referring to the dates, times, duration, or frequencies in the clinical text. The values of the extracted temporal expressions had to be normalized to an ISO specification standard; and (3) temporal relations, among the clinical events and temporal expressions. GOALS - The objectives involved in the current work consists on outperforming previous State of the Art for the i2b2 2012 challenge and adapting Cross-lingual model into clinical specific domain with low Data resources available. METHODS - The task has been conceived as a pipeline of different modules, an event and temporal expression token-classifier and a text-classifier for relation extraction, each of them independently developed from the other. We used XLM-RoBERTa Cross-lingual model. RESULTS - For event detection, the proposed token-classifier obtains a 0.91 Span F1. For temporal expressions, our sentence-classifier achieves a 0.91 Span F1. For temporal relation, we propose sentence classifier based on sequential-taggers that performs at 0.29 F1 measure.DESKRIBAPENA - Narratiba klinikoen domeinuan i2b2 2012 erronkarako hizkuntzarteko ikuspegia jorratzen duen soluzioa proposatzen dugu. Erronka honek txosten medikuetan islatzen diren gertaeren arteko denbora-erlazioak iragartzea du helburu. Horretarako, lan hau alde batetik (1) klinikoki esanguratsuak diren gertaerak, adibidez, kontzeptu klinikoak, probak, tratamenduak, sail klinikoak eta bestetik, (2) denbora-adierazpenak, adibidez, txostenak esleituta duen data, denbora, iraupen edo maiztasuna adierazten duten espresioak antzeman eta bukatzeko gertaera klinikoen eta (3) denbora-adierazpenen arteako erlazioak anotatuta duen corpus batetik abiatzen da. HELBURUAK - Lanaren helburuak i2b2 2012 artearen egoera hobetzea eta Cross-lingual modeloa Data baliabide baxuak dituen domeinu kliniko espezifikora egokitzea dira. METODOAK - Lana modulu desberdinetako hobi gisa ulertu da, gertaera eta denbora-adierazpenetarako sekuentzia-markatzaileak, eta denbora-erlaziorako perpaus-sailkatzailea, independenteki garatu dira. XLM-RoBERTa Cross-lingual modeloa erabili izan da lan honetan. EMAITZAK - Gertaerak atzemateko, 0.91 Span F1 exekutatzen duen sekuentzia-markatzailea proposatzen dugu. Denbora-adierazpenetarako, 0.91 Span F1 egiten duen sekuentzia-markatzailea bat proposatzen dugu. Denbora-erlaziorako, 0.29 F1 neurria egiten duten sekuentzia-markatzaileetan oinarritutako perpaus-sailkatzailea proposatzen dugu

    End to end approach for i2b2 2012 challenge based on Cross-lingual models

    Get PDF
    BACKGROUND - We propose a Cross-lingual approach to i2b2 2012 challenge for Clinical Records focused on the temporal relations in clinical narratives. Corpus of discharge summaries annotated with temporal information was provided for automatically extracting : (1) clinically significant events, including both clinical concepts such as problems, tests, treatments, and clinical departments, and events relevant to the patient’s clinical timeline, such as admissions, transfers between departments, etc; (2) temporal expressions, referring to the dates, times, duration, or frequencies in the clinical text. The values of the extracted temporal expressions had to be normalized to an ISO specification standard; and (3) temporal relations, among the clinical events and temporal expressions. GOALS - The objectives involved in the current work consists on outperforming previous State of the Art for the i2b2 2012 challenge and adapting Cross-lingual model into clinical specific domain with low Data resources available. METHODS - The task has been conceived as a pipeline of different modules, an event and temporal expression token-classifier and a text-classifier for relation extraction, each of them independently developed from the other. We used XLM-RoBERTa Cross-lingual model. RESULTS - For event detection, the proposed token-classifier obtains a 0.91 Span F1. For temporal expressions, our sentence-classifier achieves a 0.91 Span F1. For temporal relation, we propose sentence classifier based on sequential-taggers that performs at 0.29 F1 measure.DESKRIBAPENA - Narratiba klinikoen domeinuan i2b2 2012 erronkarako hizkuntzarteko ikuspegia jorratzen duen soluzioa proposatzen dugu. Erronka honek txosten medikuetan islatzen diren gertaeren arteko denbora-erlazioak iragartzea du helburu. Horretarako, lan hau alde batetik (1) klinikoki esanguratsuak diren gertaerak, adibidez, kontzeptu klinikoak, probak, tratamenduak, sail klinikoak eta bestetik, (2) denbora-adierazpenak, adibidez, txostenak esleituta duen data, denbora, iraupen edo maiztasuna adierazten duten espresioak antzeman eta bukatzeko gertaera klinikoen eta (3) denbora-adierazpenen arteako erlazioak anotatuta duen corpus batetik abiatzen da. HELBURUAK - Lanaren helburuak i2b2 2012 artearen egoera hobetzea eta Cross-lingual modeloa Data baliabide baxuak dituen domeinu kliniko espezifikora egokitzea dira. METODOAK - Lana modulu desberdinetako hobi gisa ulertu da, gertaera eta denbora-adierazpenetarako sekuentzia-markatzaileak, eta denbora-erlaziorako perpaus-sailkatzailea, independenteki garatu dira. XLM-RoBERTa Cross-lingual modeloa erabili izan da lan honetan. EMAITZAK - Gertaerak atzemateko, 0.91 Span F1 exekutatzen duen sekuentzia-markatzailea proposatzen dugu. Denbora-adierazpenetarako, 0.91 Span F1 egiten duen sekuentzia-markatzailea bat proposatzen dugu. Denbora-erlaziorako, 0.29 F1 neurria egiten duten sekuentzia-markatzaileetan oinarritutako perpaus-sailkatzailea proposatzen dugu

    Using Neural Networks for Relation Extraction from Biomedical Literature

    Full text link
    Using different sources of information to support automated extracting of relations between biomedical concepts contributes to the development of our understanding of biological systems. The primary comprehensive source of these relations is biomedical literature. Several relation extraction approaches have been proposed to identify relations between concepts in biomedical literature, namely, using neural networks algorithms. The use of multichannel architectures composed of multiple data representations, as in deep neural networks, is leading to state-of-the-art results. The right combination of data representations can eventually lead us to even higher evaluation scores in relation extraction tasks. Thus, biomedical ontologies play a fundamental role by providing semantic and ancestry information about an entity. The incorporation of biomedical ontologies has already been proved to enhance previous state-of-the-art results.Comment: Artificial Neural Networks book (Springer) - Chapter 1

    A survey on recent advances in named entity recognition

    Full text link
    Named Entity Recognition seeks to extract substrings within a text that name real-world objects and to determine their type (for example, whether they refer to persons or organizations). In this survey, we first present an overview of recent popular approaches, but we also look at graph- and transformer- based methods including Large Language Models (LLMs) that have not had much coverage in other surveys. Second, we focus on methods designed for datasets with scarce annotations. Third, we evaluate the performance of the main NER implementations on a variety of datasets with differing characteristics (as regards their domain, their size, and their number of classes). We thus provide a deep comparison of algorithms that are never considered together. Our experiments shed some light on how the characteristics of datasets affect the behavior of the methods that we compare.Comment: 30 page
    corecore