9 research outputs found
Stochastic Adversarial Gradient Embedding for Active Domain Adaptation
Unsupervised Domain Adaptation (UDA) aims to bridge the gap between a source
domain, where labelled data are available, and a target domain only represented
with unlabelled data. If domain invariant representations have dramatically
improved the adaptability of models, to guarantee their good transferability
remains a challenging problem. This paper addresses this problem by using
active learning to annotate a small budget of target data. Although this setup,
called Active Domain Adaptation (ADA), deviates from UDA's standard setup, a
wide range of practical applications are faced with this situation. To this
purpose, we introduce \textit{Stochastic Adversarial Gradient Embedding}
(SAGE), a framework that makes a triple contribution to ADA. First, we select
for annotation target samples that are likely to improve the representations'
transferability by measuring the variation, before and after annotation, of the
transferability loss gradient. Second, we increase sampling diversity by
promoting different gradient directions. Third, we introduce a novel training
procedure for actively incorporating target samples when learning invariant
representations. SAGE is based on solid theoretical ground and validated on
various UDA benchmarks against several baselines. Our empirical investigation
demonstrates that SAGE takes the best of uncertainty \textit{vs} diversity
samplings and improves representations transferability substantially
Automating the anonymisation of textual corpora
[EU] Gaur egun, testu berriak etengabe sortzen doaz sare sozialetako mezu, osasun-txosten,
dokumentu o zial eta halakoen ondorioz. Hala ere, testuok informazio pertsonala baldin
badute, ezin dira ikerkuntzarako edota beste helburutarako baliatu, baldin eta aldez
aurretik ez badira anonimizatzen. Anonimizatze hori automatikoki egitea erronka handia
da eta askotan hutsetik anotatutako domeinukako datuak behar dira, ez baita arrunta
helburutzat ditugun testuinguruetarako anotatutako corpusak izatea. Hala, tesi honek bi
helburu ditu: (i) Gaztelaniazko elkarrizketa espontaneoz osatutako corpus anonimizatu
berri bat konpilatu eta eskura jartzea, eta (ii) sortutako baliabide hau ustiatzea
informazio sentiberaren identi kazio-teknikak aztertzeko, helburu gisa dugun domeinuan
testu etiketaturik izan gabe. Hala, lehenengo helburuari lotuta, ES-Port izeneko corpusa
sortu dugu. Telekomunikazio-ekoizle batek ahoz laguntza teknikoa ematen duenean sortu
diren 1170 elkarrizketa espontaneoek osatzen dute corpusa. Ordezkatze-tekniken bidez
anonimizatu da, eta ondorioz emaitza testu irakurgarri eta naturala izan da. Hamaika
anonimizazio-kategoria landu dira, eta baita hizkuntzakoak eta hizkuntzatik kanpokoak
diren beste zenbait anonimizazio-fenomeno ere, hala nola, kode-aldaketa, barrea,
errepikapena, ahoskatze okerrak, eta abar. Bigarren helburuari lotuta, berriz,
anonimizatu beharreko informazio sentibera identi katzeko, gordailuan oinarritutako
Ikasketa Aktiboa erabili da, honek helburutzat baitu ahalik eta testu anotatu
gutxienarekin sailkatzaile ahalik eta onena lortzea. Horretaz gain, emaitzak hobetzeko,
eta abiapuntuko hautaketarako eta galderen hautaketarako estrategiak aztertzeko,
Ezagutza Transferentzian oinarritutako teknikak ustiatu dira, aldez aurretik anotatuta
zegoen corpus txiki bat oinarri hartuta. Emaitzek adierazi dute, lan honetan
aukeratutako metodoak egokienak izan direla abiapuntuko hautaketa egiteko eta
kontsulta-estrategia gisa iturri eta helburu sailkapenen zalantzak konbinatzeak Ikasketa
Aktiboa hobetzen duela, ikaskuntza-kurba bizkorragoak eta sailkapen-errendimendu
handiagoak lortuz iterazio gutxiagotan.[EN] A huge amount of new textual data are created day by day through social media posts, health records, official documents, and so on. However, if such resources contain personal data, they cannot be shared for research or other purposes without undergoing proper anonymisation. Automating such task is challenging and often requires labelling in-domain data from scratch since anonymised annotated corpora for the target scenarios are rarely available. This thesis has two main objectives: (i) to compile and provide a new corpus in Spanish with annotated anonymised spontaneous dialogue data, and (ii) to exploit the newly provided resource to investigate techniques for automating the sensitive data identification task, in a setting where initially no annotated data from the target domain are available. Following such aims, first, the ES-Port corpus is presented. It is a compilation of 1170 spontaneous spoken human-human dialogues from calls to the technical support service of a telecommunications provider. The corpus has been anonymised using the substitution technique, which implies the result is a readable natural text, and it contains annotations of eleven different anonymisation categories, as well as some linguistic and extra-linguistic phenomena annotations like code-switching, laughter, repetitions, mispronunciations, and so on. Next, the compiled corpus is used to investigate automatic sensitive data identification within a pool-based Active Learning framework, whose aim is to obtain the best possible classifier having to annotate as little data as possible. In order to improve such setting, Knowledge Transfer techniques from another small available anonymisation annotated corpus are explored for seed selection and query selection strategies. Results show that using the proposed seed selection methods obtain the best seeds on which to initialise the base learner's training and that combining source and target classifiers' uncertainties as query strategy improves the Active Learning process, deriving in steeper learning curves and reaching top classifier performance in fewer iterations
Automating the anonymisation of textual corpora
[EU] Gaur egun, testu berriak etengabe sortzen doaz sare sozialetako mezu, osasun-txosten,
dokumentu o zial eta halakoen ondorioz. Hala ere, testuok informazio pertsonala baldin
badute, ezin dira ikerkuntzarako edota beste helburutarako baliatu, baldin eta aldez
aurretik ez badira anonimizatzen. Anonimizatze hori automatikoki egitea erronka handia
da eta askotan hutsetik anotatutako domeinukako datuak behar dira, ez baita arrunta
helburutzat ditugun testuinguruetarako anotatutako corpusak izatea. Hala, tesi honek bi
helburu ditu: (i) Gaztelaniazko elkarrizketa espontaneoz osatutako corpus anonimizatu
berri bat konpilatu eta eskura jartzea, eta (ii) sortutako baliabide hau ustiatzea
informazio sentiberaren identi kazio-teknikak aztertzeko, helburu gisa dugun domeinuan
testu etiketaturik izan gabe. Hala, lehenengo helburuari lotuta, ES-Port izeneko corpusa
sortu dugu. Telekomunikazio-ekoizle batek ahoz laguntza teknikoa ematen duenean sortu
diren 1170 elkarrizketa espontaneoek osatzen dute corpusa. Ordezkatze-tekniken bidez
anonimizatu da, eta ondorioz emaitza testu irakurgarri eta naturala izan da. Hamaika
anonimizazio-kategoria landu dira, eta baita hizkuntzakoak eta hizkuntzatik kanpokoak
diren beste zenbait anonimizazio-fenomeno ere, hala nola, kode-aldaketa, barrea,
errepikapena, ahoskatze okerrak, eta abar. Bigarren helburuari lotuta, berriz,
anonimizatu beharreko informazio sentibera identi katzeko, gordailuan oinarritutako
Ikasketa Aktiboa erabili da, honek helburutzat baitu ahalik eta testu anotatu
gutxienarekin sailkatzaile ahalik eta onena lortzea. Horretaz gain, emaitzak hobetzeko,
eta abiapuntuko hautaketarako eta galderen hautaketarako estrategiak aztertzeko,
Ezagutza Transferentzian oinarritutako teknikak ustiatu dira, aldez aurretik anotatuta
zegoen corpus txiki bat oinarri hartuta. Emaitzek adierazi dute, lan honetan
aukeratutako metodoak egokienak izan direla abiapuntuko hautaketa egiteko eta
kontsulta-estrategia gisa iturri eta helburu sailkapenen zalantzak konbinatzeak Ikasketa
Aktiboa hobetzen duela, ikaskuntza-kurba bizkorragoak eta sailkapen-errendimendu
handiagoak lortuz iterazio gutxiagotan.[EN] A huge amount of new textual data are created day by day through social media posts, health records, official documents, and so on. However, if such resources contain personal data, they cannot be shared for research or other purposes without undergoing proper anonymisation. Automating such task is challenging and often requires labelling in-domain data from scratch since anonymised annotated corpora for the target scenarios are rarely available. This thesis has two main objectives: (i) to compile and provide a new corpus in Spanish with annotated anonymised spontaneous dialogue data, and (ii) to exploit the newly provided resource to investigate techniques for automating the sensitive data identification task, in a setting where initially no annotated data from the target domain are available. Following such aims, first, the ES-Port corpus is presented. It is a compilation of 1170 spontaneous spoken human-human dialogues from calls to the technical support service of a telecommunications provider. The corpus has been anonymised using the substitution technique, which implies the result is a readable natural text, and it contains annotations of eleven different anonymisation categories, as well as some linguistic and extra-linguistic phenomena annotations like code-switching, laughter, repetitions, mispronunciations, and so on. Next, the compiled corpus is used to investigate automatic sensitive data identification within a pool-based Active Learning framework, whose aim is to obtain the best possible classifier having to annotate as little data as possible. In order to improve such setting, Knowledge Transfer techniques from another small available anonymisation annotated corpus are explored for seed selection and query selection strategies. Results show that using the proposed seed selection methods obtain the best seeds on which to initialise the base learner's training and that combining source and target classifiers' uncertainties as query strategy improves the Active Learning process, deriving in steeper learning curves and reaching top classifier performance in fewer iterations
Recommended from our members
Algorithms for Query-Efficient Active Learning
Recent decades have witnessed great success of machine learning, especially for tasks where large annotated datasets are available for training models. However, in many applications, raw data, such as images, are abundant, but annotations, such as descriptions of images, are scarce. Annotating data requires human effort and can be expensive. Consequently, one of the central problems in machine learning is how to train an accurate model with as few human annotations as possible. Active learning addresses this problem by bringing the annotator to work together with the learner in the learning process. In active learning, a learner can sequentially select examples and ask the annotator for labels, so that it may require fewer annotations if the learning algorithm avoids querying less informative examples.This dissertation focuses on designing provable query-efficient active learning algorithms. The main contributions are as follows. First, we study noise-tolerant active learning in the standard stream-based setting. We propose a computationally efficient algorithm for actively learning homogeneous halfspaces under bounded noise, and prove it achieves nearly optimal label complexity. Second, we theoretically investigate a novel interactive model where the annotator can not only return noisy labels, but also abstain from labeling. We propose an algorithm which utilizes abstention responses, and analyze its statistical consistency and query complexity under different conditions of the noise and abstention rate. Finally, we study how to utilize auxiliary datasets in active learning. We consider a scenario where the learner has access to a logged observational dataset where labeled examples are observed conditioned on a selection policy. We propose algorithms that effectively take advantage of both auxiliary datasets and active learning. We prove that these algorithms are statistically consistent, and achieve a lower label requirement than alternative methods theoretically and empirically