4 research outputs found
TopRoBERTa: Topology-Aware Authorship Attribution of Deepfake Texts
Recent advances in Large Language Models (LLMs) have enabled the generation
of open-ended high-quality texts, that are non-trivial to distinguish from
human-written texts. We refer to such LLM-generated texts as \emph{deepfake
texts}. There are currently over 11K text generation models in the huggingface
model repo. As such, users with malicious intent can easily use these
open-sourced LLMs to generate harmful texts and misinformation at scale. To
mitigate this problem, a computational method to determine if a given text is a
deepfake text or not is desired--i.e., Turing Test (TT). In particular, in this
work, we investigate the more general version of the problem, known as
\emph{Authorship Attribution (AA)}, in a multi-class setting--i.e., not only
determining if a given text is a deepfake text or not but also being able to
pinpoint which LLM is the author. We propose \textbf{TopRoBERTa} to improve
existing AA solutions by capturing more linguistic patterns in deepfake texts
by including a Topological Data Analysis (TDA) layer in the RoBERTa model. We
show the benefits of having a TDA layer when dealing with noisy, imbalanced,
and heterogeneous datasets, by extracting TDA features from the reshaped
of RoBERTa as input. We use RoBERTa to capture contextual
representations (i.e., semantic and syntactic linguistic features), while using
TDA to capture the shape and structure of data (i.e., linguistic structures).
Finally, \textbf{TopRoBERTa}, outperforms the vanilla RoBERTa in 2/3 datasets,
achieving up to 7\% increase in Macro F1 score
Automating the anonymisation of textual corpora
[EU] Gaur egun, testu berriak etengabe sortzen doaz sare sozialetako mezu, osasun-txosten,
dokumentu o zial eta halakoen ondorioz. Hala ere, testuok informazio pertsonala baldin
badute, ezin dira ikerkuntzarako edota beste helburutarako baliatu, baldin eta aldez
aurretik ez badira anonimizatzen. Anonimizatze hori automatikoki egitea erronka handia
da eta askotan hutsetik anotatutako domeinukako datuak behar dira, ez baita arrunta
helburutzat ditugun testuinguruetarako anotatutako corpusak izatea. Hala, tesi honek bi
helburu ditu: (i) Gaztelaniazko elkarrizketa espontaneoz osatutako corpus anonimizatu
berri bat konpilatu eta eskura jartzea, eta (ii) sortutako baliabide hau ustiatzea
informazio sentiberaren identi kazio-teknikak aztertzeko, helburu gisa dugun domeinuan
testu etiketaturik izan gabe. Hala, lehenengo helburuari lotuta, ES-Port izeneko corpusa
sortu dugu. Telekomunikazio-ekoizle batek ahoz laguntza teknikoa ematen duenean sortu
diren 1170 elkarrizketa espontaneoek osatzen dute corpusa. Ordezkatze-tekniken bidez
anonimizatu da, eta ondorioz emaitza testu irakurgarri eta naturala izan da. Hamaika
anonimizazio-kategoria landu dira, eta baita hizkuntzakoak eta hizkuntzatik kanpokoak
diren beste zenbait anonimizazio-fenomeno ere, hala nola, kode-aldaketa, barrea,
errepikapena, ahoskatze okerrak, eta abar. Bigarren helburuari lotuta, berriz,
anonimizatu beharreko informazio sentibera identi katzeko, gordailuan oinarritutako
Ikasketa Aktiboa erabili da, honek helburutzat baitu ahalik eta testu anotatu
gutxienarekin sailkatzaile ahalik eta onena lortzea. Horretaz gain, emaitzak hobetzeko,
eta abiapuntuko hautaketarako eta galderen hautaketarako estrategiak aztertzeko,
Ezagutza Transferentzian oinarritutako teknikak ustiatu dira, aldez aurretik anotatuta
zegoen corpus txiki bat oinarri hartuta. Emaitzek adierazi dute, lan honetan
aukeratutako metodoak egokienak izan direla abiapuntuko hautaketa egiteko eta
kontsulta-estrategia gisa iturri eta helburu sailkapenen zalantzak konbinatzeak Ikasketa
Aktiboa hobetzen duela, ikaskuntza-kurba bizkorragoak eta sailkapen-errendimendu
handiagoak lortuz iterazio gutxiagotan.[EN] A huge amount of new textual data are created day by day through social media posts, health records, official documents, and so on. However, if such resources contain personal data, they cannot be shared for research or other purposes without undergoing proper anonymisation. Automating such task is challenging and often requires labelling in-domain data from scratch since anonymised annotated corpora for the target scenarios are rarely available. This thesis has two main objectives: (i) to compile and provide a new corpus in Spanish with annotated anonymised spontaneous dialogue data, and (ii) to exploit the newly provided resource to investigate techniques for automating the sensitive data identification task, in a setting where initially no annotated data from the target domain are available. Following such aims, first, the ES-Port corpus is presented. It is a compilation of 1170 spontaneous spoken human-human dialogues from calls to the technical support service of a telecommunications provider. The corpus has been anonymised using the substitution technique, which implies the result is a readable natural text, and it contains annotations of eleven different anonymisation categories, as well as some linguistic and extra-linguistic phenomena annotations like code-switching, laughter, repetitions, mispronunciations, and so on. Next, the compiled corpus is used to investigate automatic sensitive data identification within a pool-based Active Learning framework, whose aim is to obtain the best possible classifier having to annotate as little data as possible. In order to improve such setting, Knowledge Transfer techniques from another small available anonymisation annotated corpus are explored for seed selection and query selection strategies. Results show that using the proposed seed selection methods obtain the best seeds on which to initialise the base learner's training and that combining source and target classifiers' uncertainties as query strategy improves the Active Learning process, deriving in steeper learning curves and reaching top classifier performance in fewer iterations
Automating the anonymisation of textual corpora
[EU] Gaur egun, testu berriak etengabe sortzen doaz sare sozialetako mezu, osasun-txosten,
dokumentu o zial eta halakoen ondorioz. Hala ere, testuok informazio pertsonala baldin
badute, ezin dira ikerkuntzarako edota beste helburutarako baliatu, baldin eta aldez
aurretik ez badira anonimizatzen. Anonimizatze hori automatikoki egitea erronka handia
da eta askotan hutsetik anotatutako domeinukako datuak behar dira, ez baita arrunta
helburutzat ditugun testuinguruetarako anotatutako corpusak izatea. Hala, tesi honek bi
helburu ditu: (i) Gaztelaniazko elkarrizketa espontaneoz osatutako corpus anonimizatu
berri bat konpilatu eta eskura jartzea, eta (ii) sortutako baliabide hau ustiatzea
informazio sentiberaren identi kazio-teknikak aztertzeko, helburu gisa dugun domeinuan
testu etiketaturik izan gabe. Hala, lehenengo helburuari lotuta, ES-Port izeneko corpusa
sortu dugu. Telekomunikazio-ekoizle batek ahoz laguntza teknikoa ematen duenean sortu
diren 1170 elkarrizketa espontaneoek osatzen dute corpusa. Ordezkatze-tekniken bidez
anonimizatu da, eta ondorioz emaitza testu irakurgarri eta naturala izan da. Hamaika
anonimizazio-kategoria landu dira, eta baita hizkuntzakoak eta hizkuntzatik kanpokoak
diren beste zenbait anonimizazio-fenomeno ere, hala nola, kode-aldaketa, barrea,
errepikapena, ahoskatze okerrak, eta abar. Bigarren helburuari lotuta, berriz,
anonimizatu beharreko informazio sentibera identi katzeko, gordailuan oinarritutako
Ikasketa Aktiboa erabili da, honek helburutzat baitu ahalik eta testu anotatu
gutxienarekin sailkatzaile ahalik eta onena lortzea. Horretaz gain, emaitzak hobetzeko,
eta abiapuntuko hautaketarako eta galderen hautaketarako estrategiak aztertzeko,
Ezagutza Transferentzian oinarritutako teknikak ustiatu dira, aldez aurretik anotatuta
zegoen corpus txiki bat oinarri hartuta. Emaitzek adierazi dute, lan honetan
aukeratutako metodoak egokienak izan direla abiapuntuko hautaketa egiteko eta
kontsulta-estrategia gisa iturri eta helburu sailkapenen zalantzak konbinatzeak Ikasketa
Aktiboa hobetzen duela, ikaskuntza-kurba bizkorragoak eta sailkapen-errendimendu
handiagoak lortuz iterazio gutxiagotan.[EN] A huge amount of new textual data are created day by day through social media posts, health records, official documents, and so on. However, if such resources contain personal data, they cannot be shared for research or other purposes without undergoing proper anonymisation. Automating such task is challenging and often requires labelling in-domain data from scratch since anonymised annotated corpora for the target scenarios are rarely available. This thesis has two main objectives: (i) to compile and provide a new corpus in Spanish with annotated anonymised spontaneous dialogue data, and (ii) to exploit the newly provided resource to investigate techniques for automating the sensitive data identification task, in a setting where initially no annotated data from the target domain are available. Following such aims, first, the ES-Port corpus is presented. It is a compilation of 1170 spontaneous spoken human-human dialogues from calls to the technical support service of a telecommunications provider. The corpus has been anonymised using the substitution technique, which implies the result is a readable natural text, and it contains annotations of eleven different anonymisation categories, as well as some linguistic and extra-linguistic phenomena annotations like code-switching, laughter, repetitions, mispronunciations, and so on. Next, the compiled corpus is used to investigate automatic sensitive data identification within a pool-based Active Learning framework, whose aim is to obtain the best possible classifier having to annotate as little data as possible. In order to improve such setting, Knowledge Transfer techniques from another small available anonymisation annotated corpus are explored for seed selection and query selection strategies. Results show that using the proposed seed selection methods obtain the best seeds on which to initialise the base learner's training and that combining source and target classifiers' uncertainties as query strategy improves the Active Learning process, deriving in steeper learning curves and reaching top classifier performance in fewer iterations
Sustainable Agriculture and Rural Development in Terms of The Republic of Serbia Strategic Goals Realization within The Danube Region(preservation of rural values)
International Scientific Meeting „Sustainable Agriculture and Rural Development in Terms of The Republic of Serbia Strategic Goals Realization within The Danube Region“ (preservation of rural values), which be held in period 6-8th December 2012 on mountain Tara (Republic Serbia), through major number of presented papers provides an overwiew of results of scientific research on the integrated and interdisciplinary project „Sustainable agriculture and rural development in terms of the Republic of Serbia strategic goals realization within the danube region“.
International Scientific Meeting „SUSTAINABLE AGRICULTURE AND RURAL DEVELOPMENT IN TERMS OF THE REPUBLIC OF SERBIA STRATEGIC GOALS REALIZATION WITHIN THE DANUBE REGION“ (preservation of rural values), gathered major number of scientific and experts researchers from about the countries. Besides the authors from Republic Serbia in papers are represented and authors from Romania, Bulgaria, Russian Federation, Bosnia and Herzegovina, Hungary, Netherland and Macedonia, Poland.
In frame of the Proceedings, is positively evaluated by the reviewer and presented on the Scientific Meeting 91 paper and it is published in the Proceeding. Publisher is Institute of Agricultural Economics, Belgrade, together with 38 eminent scientific and educational Institution from Serbia and foreing. In the Plenary section was presents three (3) papers which stand out with their contributions to our Scientific Meeting. Rest of the paper are systematized in three (3) sections.
Represent and published papers are systematized in three (3) thematic section:
I SUSTAINABLE DEVELOPMENT AS A MODERN DEVELOPMENTAL APPROACH IN PRESERVATION OF AGRICULTURE AND RURAL VALUES (in this section represented 41 papers);
II STRATEGIC PLANNING AND INSTITUTIONAL-POLITICAL DIMENSION OF AGRARIAN AND RURAL DEVELOPMENT (in this section represented 13 papers);
III AGRIBUSINESS OF RURAL AREAS, DIVERSIFICATION AND COMPARATIVE ADVANTAGES OF RURAL ECONOMY (in this section represented 34 papers)