5,396 research outputs found
Utility-Preserving Anonymization of Textual Documents
Cada dia els éssers humans afegim una gran quantitat de dades a Internet, tals com piulades, opinions, fotos i vídeos. Les organitzacions que recullen aquestes dades tan diverses n'extreuen informació per tal de millorar llurs serveis o bé per a propòsits comercials. Tanmateix, si les dades recollides contenen informació personal sensible, hom no les pot compartir amb tercers ni les pot publicar sense el consentiment o una protecció adequada dels subjectes de les dades. Els mecanismes de preservació de la privadesa forneixen maneres de sanejar les dades per tal que no revelin identitats o atributs confidencials.
S'ha proposat una gran varietat de mecanismes per anonimitzar bases de dades estructurades amb atributs numèrics i categòrics; en canvi, la protecció automàtica de dades textuals no estructurades ha rebut molta menys atenció. En general, l'anonimització de dades textuals exigeix, primer, detectar trossos del text que poden revelar informació sensible i, després, emmascarar aquests trossos mitjançant supressió o generalització.
En aquesta tesi fem servir diverses tecnologies per anonimitzar documents textuals. De primer, millorem les tècniques existents basades en etiquetatge de seqüències. Després, estenem aquestes tècniques per alinear-les millor amb el risc de revelació i amb les exigències de privadesa. Finalment, proposem un marc complet basat en models d'immersió de paraules que captura un concepte més ampli de protecció de dades i que forneix una protecció flexible guiada per les exigències de privadesa. També recorrem a les ontologies per preservar la utilitat del text emmascarat, és a dir, la seva semàntica i la seva llegibilitat. La nostra experimentació extensa i detallada mostra que els nostres mètodes superen els mètodes existents a l'hora de proporcionar anonimització robusta tot preservant raonablement la utilitat del text protegit.Cada día las personas añadimos una gran cantidad de datos a Internet, tales como tweets, opiniones, fotos y vídeos. Las organizaciones que recogen dichos datos los usan para extraer información para mejorar sus servicios o para propósitos comerciales. Sin embargo, si los datos recogidos contienen información personal sensible, no pueden compartirse ni publicarse sin el consentimiento o una protección adecuada de los sujetos de los datos. Los mecanismos de protección de la privacidad proporcionan maneras de sanear los datos de forma que no revelen identidades ni atributos confidenciales.
Se ha propuesto una gran variedad de mecanismos para anonimizar bases de datos estructuradas con atributos numéricos y categóricos; en cambio, la protección automática de datos textuales no estructurados ha recibido mucha menos atención. En general, la anonimización de datos textuales requiere, primero, detectar trozos de texto que puedan revelar información sensible, para luego enmascarar dichos trozos mediante supresión o generalización.
En este trabajo empleamos varias tecnologías para anonimizar documentos textuales. Primero mejoramos las técnicas existentes basadas en etiquetaje de secuencias. Posteriormente las extendmos para alinearlas mejor con la noción de riesgo de revelación y con los requisitos de privacidad. Finalmente, proponemos un marco completo basado en modelos de inmersión de palabras que captura una noción más amplia de protección de datos y ofrece protección flexible guiada por los requisitos de privacidad. También recurrimos a las ontologías para preservar la utilidad del texto enmascarado, es decir, su semantica y legibilidad. Nuestra experimentación extensa y detallada muestra que nuestros métodos superan a los existentes a la hora de proporcionar una anonimización más robusta al tiempo que se preserva razonablemente la utilidad del texto protegido.Every day, people post a significant amount of data on the Internet, such as tweets, reviews, photos, and videos. Organizations collecting these types of data use them to extract information in order to improve their services or for commercial purposes. Yet, if the collected data contain sensitive personal information, they cannot be shared with third parties or released publicly without consent or adequate protection of the data subjects. Privacy-preserving mechanisms provide ways to sanitize data so that identities and/or confidential attributes are not disclosed.
A great variety of mechanisms have been proposed to anonymize structured databases with numerical and categorical attributes; however, automatically protecting unstructured textual data has received much less attention. In general, textual data anonymization requires, first, to detect pieces of text that may disclose sensitive information and, then, to mask those pieces via suppression or generalization.
In this work, we leverage several technologies to anonymize textual documents. We first improve state-of-the-art techniques based on sequence labeling. After that, we extend them to make them more aligned with the notion of privacy risk and the privacy requirements. Finally, we propose a complete framework based on word embedding models that captures a broader notion of data protection and provides flexible protection driven by privacy requirements. We also leverage ontologies to preserve the utility of the masked text, that is, its semantics and readability. Extensive experimental results show that our methods outperform the state of the art by providing more robust anonymization while reasonably preserving the utility of the protected outcome
A Survey on Deep Learning in Medical Image Analysis
Deep learning algorithms, in particular convolutional networks, have rapidly
become a methodology of choice for analyzing medical images. This paper reviews
the major deep learning concepts pertinent to medical image analysis and
summarizes over 300 contributions to the field, most of which appeared in the
last year. We survey the use of deep learning for image classification, object
detection, segmentation, registration, and other tasks and provide concise
overviews of studies per application area. Open challenges and directions for
future research are discussed.Comment: Revised survey includes expanded discussion section and reworked
introductory section on common deep architectures. Added missed papers from
before Feb 1st 201
End to end approach for i2b2 2012 challenge based on Cross-lingual models
BACKGROUND - We propose a Cross-lingual approach to i2b2 2012 challenge for Clinical
Records focused on the temporal relations in clinical narratives. Corpus of discharge
summaries annotated with temporal information was provided for automatically
extracting : (1) clinically significant events, including both clinical concepts such as
problems, tests, treatments, and clinical departments, and events relevant to the patient’s
clinical timeline, such as admissions, transfers between departments, etc; (2) temporal
expressions, referring to the dates, times, duration, or frequencies in the clinical text. The
values of the extracted temporal expressions had to be normalized to an ISO specification
standard; and (3) temporal relations, among the clinical events and temporal expressions.
GOALS - The objectives involved in the current work consists on outperforming previous
State of the Art for the i2b2 2012 challenge and adapting Cross-lingual model into
clinical specific domain with low Data resources available.
METHODS - The task has been conceived as a pipeline of different modules, an event and
temporal expression token-classifier and a text-classifier for relation extraction, each of
them independently developed from the other. We used XLM-RoBERTa Cross-lingual
model.
RESULTS - For event detection, the proposed token-classifier obtains a 0.91 Span F1. For
temporal expressions, our sentence-classifier achieves a 0.91 Span F1. For temporal
relation, we propose sentence classifier based on sequential-taggers that performs at 0.29
F1 measure.DESKRIBAPENA - Narratiba klinikoen domeinuan i2b2 2012 erronkarako hizkuntzarteko
ikuspegia jorratzen duen soluzioa proposatzen dugu. Erronka honek txosten medikuetan
islatzen diren gertaeren arteko denbora-erlazioak iragartzea du helburu. Horretarako, lan
hau alde batetik (1) klinikoki esanguratsuak diren gertaerak, adibidez, kontzeptu
klinikoak, probak, tratamenduak, sail klinikoak eta bestetik, (2) denbora-adierazpenak,
adibidez, txostenak esleituta duen data, denbora, iraupen edo maiztasuna adierazten
duten espresioak antzeman eta bukatzeko gertaera klinikoen eta (3)
denbora-adierazpenen arteako erlazioak anotatuta duen corpus batetik abiatzen da.
HELBURUAK - Lanaren helburuak i2b2 2012 artearen egoera hobetzea eta Cross-lingual
modeloa Data baliabide baxuak dituen domeinu kliniko espezifikora egokitzea dira.
METODOAK - Lana modulu desberdinetako hobi gisa ulertu da, gertaera eta
denbora-adierazpenetarako sekuentzia-markatzaileak, eta denbora-erlaziorako
perpaus-sailkatzailea, independenteki garatu dira. XLM-RoBERTa Cross-lingual modeloa
erabili izan da lan honetan.
EMAITZAK - Gertaerak atzemateko, 0.91 Span F1 exekutatzen duen
sekuentzia-markatzailea proposatzen dugu. Denbora-adierazpenetarako, 0.91 Span F1
egiten duen sekuentzia-markatzailea bat proposatzen dugu. Denbora-erlaziorako, 0.29 F1
neurria egiten duten sekuentzia-markatzaileetan oinarritutako perpaus-sailkatzailea
proposatzen dugu
End to end approach for i2b2 2012 challenge based on Cross-lingual models
BACKGROUND - We propose a Cross-lingual approach to i2b2 2012 challenge for Clinical
Records focused on the temporal relations in clinical narratives. Corpus of discharge
summaries annotated with temporal information was provided for automatically
extracting : (1) clinically significant events, including both clinical concepts such as
problems, tests, treatments, and clinical departments, and events relevant to the patient’s
clinical timeline, such as admissions, transfers between departments, etc; (2) temporal
expressions, referring to the dates, times, duration, or frequencies in the clinical text. The
values of the extracted temporal expressions had to be normalized to an ISO specification
standard; and (3) temporal relations, among the clinical events and temporal expressions.
GOALS - The objectives involved in the current work consists on outperforming previous
State of the Art for the i2b2 2012 challenge and adapting Cross-lingual model into
clinical specific domain with low Data resources available.
METHODS - The task has been conceived as a pipeline of different modules, an event and
temporal expression token-classifier and a text-classifier for relation extraction, each of
them independently developed from the other. We used XLM-RoBERTa Cross-lingual
model.
RESULTS - For event detection, the proposed token-classifier obtains a 0.91 Span F1. For
temporal expressions, our sentence-classifier achieves a 0.91 Span F1. For temporal
relation, we propose sentence classifier based on sequential-taggers that performs at 0.29
F1 measure.DESKRIBAPENA - Narratiba klinikoen domeinuan i2b2 2012 erronkarako hizkuntzarteko
ikuspegia jorratzen duen soluzioa proposatzen dugu. Erronka honek txosten medikuetan
islatzen diren gertaeren arteko denbora-erlazioak iragartzea du helburu. Horretarako, lan
hau alde batetik (1) klinikoki esanguratsuak diren gertaerak, adibidez, kontzeptu
klinikoak, probak, tratamenduak, sail klinikoak eta bestetik, (2) denbora-adierazpenak,
adibidez, txostenak esleituta duen data, denbora, iraupen edo maiztasuna adierazten
duten espresioak antzeman eta bukatzeko gertaera klinikoen eta (3)
denbora-adierazpenen arteako erlazioak anotatuta duen corpus batetik abiatzen da.
HELBURUAK - Lanaren helburuak i2b2 2012 artearen egoera hobetzea eta Cross-lingual
modeloa Data baliabide baxuak dituen domeinu kliniko espezifikora egokitzea dira.
METODOAK - Lana modulu desberdinetako hobi gisa ulertu da, gertaera eta
denbora-adierazpenetarako sekuentzia-markatzaileak, eta denbora-erlaziorako
perpaus-sailkatzailea, independenteki garatu dira. XLM-RoBERTa Cross-lingual modeloa
erabili izan da lan honetan.
EMAITZAK - Gertaerak atzemateko, 0.91 Span F1 exekutatzen duen
sekuentzia-markatzailea proposatzen dugu. Denbora-adierazpenetarako, 0.91 Span F1
egiten duen sekuentzia-markatzailea bat proposatzen dugu. Denbora-erlaziorako, 0.29 F1
neurria egiten duten sekuentzia-markatzaileetan oinarritutako perpaus-sailkatzailea
proposatzen dugu
Using Neural Networks for Relation Extraction from Biomedical Literature
Using different sources of information to support automated extracting of
relations between biomedical concepts contributes to the development of our
understanding of biological systems. The primary comprehensive source of these
relations is biomedical literature. Several relation extraction approaches have
been proposed to identify relations between concepts in biomedical literature,
namely, using neural networks algorithms. The use of multichannel architectures
composed of multiple data representations, as in deep neural networks, is
leading to state-of-the-art results. The right combination of data
representations can eventually lead us to even higher evaluation scores in
relation extraction tasks. Thus, biomedical ontologies play a fundamental role
by providing semantic and ancestry information about an entity. The
incorporation of biomedical ontologies has already been proved to enhance
previous state-of-the-art results.Comment: Artificial Neural Networks book (Springer) - Chapter 1
A survey on recent advances in named entity recognition
Named Entity Recognition seeks to extract substrings within a text that name
real-world objects and to determine their type (for example, whether they refer
to persons or organizations). In this survey, we first present an overview of
recent popular approaches, but we also look at graph- and transformer- based
methods including Large Language Models (LLMs) that have not had much coverage
in other surveys. Second, we focus on methods designed for datasets with scarce
annotations. Third, we evaluate the performance of the main NER implementations
on a variety of datasets with differing characteristics (as regards their
domain, their size, and their number of classes). We thus provide a deep
comparison of algorithms that are never considered together. Our experiments
shed some light on how the characteristics of datasets affect the behavior of
the methods that we compare.Comment: 30 page
- …