3 research outputs found

    Automating the anonymisation of textual corpora

    Get PDF
    [EU] Gaur egun, testu berriak etengabe sortzen doaz sare sozialetako mezu, osasun-txosten, dokumentu o zial eta halakoen ondorioz. Hala ere, testuok informazio pertsonala baldin badute, ezin dira ikerkuntzarako edota beste helburutarako baliatu, baldin eta aldez aurretik ez badira anonimizatzen. Anonimizatze hori automatikoki egitea erronka handia da eta askotan hutsetik anotatutako domeinukako datuak behar dira, ez baita arrunta helburutzat ditugun testuinguruetarako anotatutako corpusak izatea. Hala, tesi honek bi helburu ditu: (i) Gaztelaniazko elkarrizketa espontaneoz osatutako corpus anonimizatu berri bat konpilatu eta eskura jartzea, eta (ii) sortutako baliabide hau ustiatzea informazio sentiberaren identi kazio-teknikak aztertzeko, helburu gisa dugun domeinuan testu etiketaturik izan gabe. Hala, lehenengo helburuari lotuta, ES-Port izeneko corpusa sortu dugu. Telekomunikazio-ekoizle batek ahoz laguntza teknikoa ematen duenean sortu diren 1170 elkarrizketa espontaneoek osatzen dute corpusa. Ordezkatze-tekniken bidez anonimizatu da, eta ondorioz emaitza testu irakurgarri eta naturala izan da. Hamaika anonimizazio-kategoria landu dira, eta baita hizkuntzakoak eta hizkuntzatik kanpokoak diren beste zenbait anonimizazio-fenomeno ere, hala nola, kode-aldaketa, barrea, errepikapena, ahoskatze okerrak, eta abar. Bigarren helburuari lotuta, berriz, anonimizatu beharreko informazio sentibera identi katzeko, gordailuan oinarritutako Ikasketa Aktiboa erabili da, honek helburutzat baitu ahalik eta testu anotatu gutxienarekin sailkatzaile ahalik eta onena lortzea. Horretaz gain, emaitzak hobetzeko, eta abiapuntuko hautaketarako eta galderen hautaketarako estrategiak aztertzeko, Ezagutza Transferentzian oinarritutako teknikak ustiatu dira, aldez aurretik anotatuta zegoen corpus txiki bat oinarri hartuta. Emaitzek adierazi dute, lan honetan aukeratutako metodoak egokienak izan direla abiapuntuko hautaketa egiteko eta kontsulta-estrategia gisa iturri eta helburu sailkapenen zalantzak konbinatzeak Ikasketa Aktiboa hobetzen duela, ikaskuntza-kurba bizkorragoak eta sailkapen-errendimendu handiagoak lortuz iterazio gutxiagotan.[EN] A huge amount of new textual data are created day by day through social media posts, health records, official documents, and so on. However, if such resources contain personal data, they cannot be shared for research or other purposes without undergoing proper anonymisation. Automating such task is challenging and often requires labelling in-domain data from scratch since anonymised annotated corpora for the target scenarios are rarely available. This thesis has two main objectives: (i) to compile and provide a new corpus in Spanish with annotated anonymised spontaneous dialogue data, and (ii) to exploit the newly provided resource to investigate techniques for automating the sensitive data identification task, in a setting where initially no annotated data from the target domain are available. Following such aims, first, the ES-Port corpus is presented. It is a compilation of 1170 spontaneous spoken human-human dialogues from calls to the technical support service of a telecommunications provider. The corpus has been anonymised using the substitution technique, which implies the result is a readable natural text, and it contains annotations of eleven different anonymisation categories, as well as some linguistic and extra-linguistic phenomena annotations like code-switching, laughter, repetitions, mispronunciations, and so on. Next, the compiled corpus is used to investigate automatic sensitive data identification within a pool-based Active Learning framework, whose aim is to obtain the best possible classifier having to annotate as little data as possible. In order to improve such setting, Knowledge Transfer techniques from another small available anonymisation annotated corpus are explored for seed selection and query selection strategies. Results show that using the proposed seed selection methods obtain the best seeds on which to initialise the base learner's training and that combining source and target classifiers' uncertainties as query strategy improves the Active Learning process, deriving in steeper learning curves and reaching top classifier performance in fewer iterations

    Automating the anonymisation of textual corpora

    Get PDF
    [EU] Gaur egun, testu berriak etengabe sortzen doaz sare sozialetako mezu, osasun-txosten, dokumentu o zial eta halakoen ondorioz. Hala ere, testuok informazio pertsonala baldin badute, ezin dira ikerkuntzarako edota beste helburutarako baliatu, baldin eta aldez aurretik ez badira anonimizatzen. Anonimizatze hori automatikoki egitea erronka handia da eta askotan hutsetik anotatutako domeinukako datuak behar dira, ez baita arrunta helburutzat ditugun testuinguruetarako anotatutako corpusak izatea. Hala, tesi honek bi helburu ditu: (i) Gaztelaniazko elkarrizketa espontaneoz osatutako corpus anonimizatu berri bat konpilatu eta eskura jartzea, eta (ii) sortutako baliabide hau ustiatzea informazio sentiberaren identi kazio-teknikak aztertzeko, helburu gisa dugun domeinuan testu etiketaturik izan gabe. Hala, lehenengo helburuari lotuta, ES-Port izeneko corpusa sortu dugu. Telekomunikazio-ekoizle batek ahoz laguntza teknikoa ematen duenean sortu diren 1170 elkarrizketa espontaneoek osatzen dute corpusa. Ordezkatze-tekniken bidez anonimizatu da, eta ondorioz emaitza testu irakurgarri eta naturala izan da. Hamaika anonimizazio-kategoria landu dira, eta baita hizkuntzakoak eta hizkuntzatik kanpokoak diren beste zenbait anonimizazio-fenomeno ere, hala nola, kode-aldaketa, barrea, errepikapena, ahoskatze okerrak, eta abar. Bigarren helburuari lotuta, berriz, anonimizatu beharreko informazio sentibera identi katzeko, gordailuan oinarritutako Ikasketa Aktiboa erabili da, honek helburutzat baitu ahalik eta testu anotatu gutxienarekin sailkatzaile ahalik eta onena lortzea. Horretaz gain, emaitzak hobetzeko, eta abiapuntuko hautaketarako eta galderen hautaketarako estrategiak aztertzeko, Ezagutza Transferentzian oinarritutako teknikak ustiatu dira, aldez aurretik anotatuta zegoen corpus txiki bat oinarri hartuta. Emaitzek adierazi dute, lan honetan aukeratutako metodoak egokienak izan direla abiapuntuko hautaketa egiteko eta kontsulta-estrategia gisa iturri eta helburu sailkapenen zalantzak konbinatzeak Ikasketa Aktiboa hobetzen duela, ikaskuntza-kurba bizkorragoak eta sailkapen-errendimendu handiagoak lortuz iterazio gutxiagotan.[EN] A huge amount of new textual data are created day by day through social media posts, health records, official documents, and so on. However, if such resources contain personal data, they cannot be shared for research or other purposes without undergoing proper anonymisation. Automating such task is challenging and often requires labelling in-domain data from scratch since anonymised annotated corpora for the target scenarios are rarely available. This thesis has two main objectives: (i) to compile and provide a new corpus in Spanish with annotated anonymised spontaneous dialogue data, and (ii) to exploit the newly provided resource to investigate techniques for automating the sensitive data identification task, in a setting where initially no annotated data from the target domain are available. Following such aims, first, the ES-Port corpus is presented. It is a compilation of 1170 spontaneous spoken human-human dialogues from calls to the technical support service of a telecommunications provider. The corpus has been anonymised using the substitution technique, which implies the result is a readable natural text, and it contains annotations of eleven different anonymisation categories, as well as some linguistic and extra-linguistic phenomena annotations like code-switching, laughter, repetitions, mispronunciations, and so on. Next, the compiled corpus is used to investigate automatic sensitive data identification within a pool-based Active Learning framework, whose aim is to obtain the best possible classifier having to annotate as little data as possible. In order to improve such setting, Knowledge Transfer techniques from another small available anonymisation annotated corpus are explored for seed selection and query selection strategies. Results show that using the proposed seed selection methods obtain the best seeds on which to initialise the base learner's training and that combining source and target classifiers' uncertainties as query strategy improves the Active Learning process, deriving in steeper learning curves and reaching top classifier performance in fewer iterations

    Synthetic Data Sharing and Estimation of Viable Dynamic Treatment Regimes with Observational Data

    Full text link
    Significant public demand arises for rapid data-driven scientific investigations using observational data, especially in personalized healthcare. This dissertation addresses three complementary challenges of analyzing complex observational data in biomedical research. The ethical challenge reflects regulatory policies and social norms regarding data privacy, which tend to emphasize data security at the expense of effective data sharing. This results in fragmentation and scarcity of available research data. In Chapter 2, we propose the DataSifter approach that mediates this challenge by facilitating the generation of realistic synthetic data from sensitive datasets containing static and time-varying variables. The DataSifter method relies on robust imputation methods, including missForest and an iterative imputation technique for time-varying variables using the Generalized Linear Mixed Model (GLMM) and the Random Effects-Expectation Maximization tree (RE-EM tree). Applications demonstrate that under a moderate level of obfuscation, the DataSifter guarantees sufficient per subject perturbations of time-invariant data and preserves the joint distribution and the energy of the entire data archive, which ensures high utility and analytical value of the time-varying information. This promotes accelerated innovation by enabling secure sharing among data governors and researchers. Once sensitive data can be securely shared, effective analytical tools are needed to provide viable individualized data-driven solutions. Observational data is an important data source for estimating dynamic treatment regimes (DTR) that guide personalized treatment decisions. The second natural challenge regards the viability of optimal DTR estimations, which may be affected by the observed treatment combinations that are not applicable for future patients due to clinical or economic reasons. In Chapter 3, we develop restricted Tree-based Reinforcement Learning to accommodate restrictions on feasible treatment combinations in observational studies by truncating possible treatment options based on patient history in a multi-stage multi-treatment setting. The proposed new method provides optimal treatment recommendations for patients only regarding viable treatment options and utilizes all valid observations in the dataset to avoid selection bias and improve efficiency. In addition to the structured data, unstructured data, such as free-text, or voice-note, have become an essential component in many biomedical studies based on clinical and health data rapidly, including electronic health records (EHR), providing extra patient information. The last two chapters in my dissertation (Chapter 4 and Chapter 5) expands the methods developed in the previous two projects by utilizing novel natural language processing (NLP) techniques to address the third challenge of handling unstructured data elements. In Chapter 4, we construct a text data anonymization tool, DataSifterText, which generates synthetic free-text data to protect sensitive unstructured data, such as personal health information. In Chapter 5, we propose to enhance the precision of optimal DTR estimation by acquiring additional information contained in clinical notes with information extraction (IE) techniques. Simulation studies and application on blood pressure management in intensive care units demonstrated that the IE techniques can provide extra patient information and more accurate counterfactual outcome modeling, because of the potentially enhanced sample size and a wider pool of candidate tailoring variables for optimal DTR estimation. The statistical methods presented in this thesis provides theoretical and practical solutions for privacy-aware utility-preserving large-scale data sharing and clinically meaningful optimal DTR estimation. The general theoretical formulation of the methods leads to the design of tools and direct applications that are expected to go beyond the biomedical and health analytics domains.PHDBiostatisticsUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttp://deepblue.lib.umich.edu/bitstream/2027.42/166113/1/zhounina_1.pd
    corecore