1,311 research outputs found

    Automating the anonymisation of textual corpora

    Get PDF
    [EU] Gaur egun, testu berriak etengabe sortzen doaz sare sozialetako mezu, osasun-txosten, dokumentu o zial eta halakoen ondorioz. Hala ere, testuok informazio pertsonala baldin badute, ezin dira ikerkuntzarako edota beste helburutarako baliatu, baldin eta aldez aurretik ez badira anonimizatzen. Anonimizatze hori automatikoki egitea erronka handia da eta askotan hutsetik anotatutako domeinukako datuak behar dira, ez baita arrunta helburutzat ditugun testuinguruetarako anotatutako corpusak izatea. Hala, tesi honek bi helburu ditu: (i) Gaztelaniazko elkarrizketa espontaneoz osatutako corpus anonimizatu berri bat konpilatu eta eskura jartzea, eta (ii) sortutako baliabide hau ustiatzea informazio sentiberaren identi kazio-teknikak aztertzeko, helburu gisa dugun domeinuan testu etiketaturik izan gabe. Hala, lehenengo helburuari lotuta, ES-Port izeneko corpusa sortu dugu. Telekomunikazio-ekoizle batek ahoz laguntza teknikoa ematen duenean sortu diren 1170 elkarrizketa espontaneoek osatzen dute corpusa. Ordezkatze-tekniken bidez anonimizatu da, eta ondorioz emaitza testu irakurgarri eta naturala izan da. Hamaika anonimizazio-kategoria landu dira, eta baita hizkuntzakoak eta hizkuntzatik kanpokoak diren beste zenbait anonimizazio-fenomeno ere, hala nola, kode-aldaketa, barrea, errepikapena, ahoskatze okerrak, eta abar. Bigarren helburuari lotuta, berriz, anonimizatu beharreko informazio sentibera identi katzeko, gordailuan oinarritutako Ikasketa Aktiboa erabili da, honek helburutzat baitu ahalik eta testu anotatu gutxienarekin sailkatzaile ahalik eta onena lortzea. Horretaz gain, emaitzak hobetzeko, eta abiapuntuko hautaketarako eta galderen hautaketarako estrategiak aztertzeko, Ezagutza Transferentzian oinarritutako teknikak ustiatu dira, aldez aurretik anotatuta zegoen corpus txiki bat oinarri hartuta. Emaitzek adierazi dute, lan honetan aukeratutako metodoak egokienak izan direla abiapuntuko hautaketa egiteko eta kontsulta-estrategia gisa iturri eta helburu sailkapenen zalantzak konbinatzeak Ikasketa Aktiboa hobetzen duela, ikaskuntza-kurba bizkorragoak eta sailkapen-errendimendu handiagoak lortuz iterazio gutxiagotan.[EN] A huge amount of new textual data are created day by day through social media posts, health records, official documents, and so on. However, if such resources contain personal data, they cannot be shared for research or other purposes without undergoing proper anonymisation. Automating such task is challenging and often requires labelling in-domain data from scratch since anonymised annotated corpora for the target scenarios are rarely available. This thesis has two main objectives: (i) to compile and provide a new corpus in Spanish with annotated anonymised spontaneous dialogue data, and (ii) to exploit the newly provided resource to investigate techniques for automating the sensitive data identification task, in a setting where initially no annotated data from the target domain are available. Following such aims, first, the ES-Port corpus is presented. It is a compilation of 1170 spontaneous spoken human-human dialogues from calls to the technical support service of a telecommunications provider. The corpus has been anonymised using the substitution technique, which implies the result is a readable natural text, and it contains annotations of eleven different anonymisation categories, as well as some linguistic and extra-linguistic phenomena annotations like code-switching, laughter, repetitions, mispronunciations, and so on. Next, the compiled corpus is used to investigate automatic sensitive data identification within a pool-based Active Learning framework, whose aim is to obtain the best possible classifier having to annotate as little data as possible. In order to improve such setting, Knowledge Transfer techniques from another small available anonymisation annotated corpus are explored for seed selection and query selection strategies. Results show that using the proposed seed selection methods obtain the best seeds on which to initialise the base learner's training and that combining source and target classifiers' uncertainties as query strategy improves the Active Learning process, deriving in steeper learning curves and reaching top classifier performance in fewer iterations

    Moving data into and out of an institutional repository: Off the map and into the territory

    Get PDF
    Given the recent proliferation of institutional repositories, a key strategic question is how multiple institutions - repositories, archives, universities and others—can best work together to manage and preserve research data. In 2007, Green and Gutmann proposed how partnerships among social science researchers, institutional repositories and domain repositories should best work. This paper uses the Timescapes Archive—a new collection of qualitative longitudinal data— to examine the challenges of working across institutions in order to move data into and out of institutional repositories. The Timescapes Archive both tests and extends their framework by focusing on the specific case of qualitative longitudinal research and by highlighting researchers' roles across all phases of data preservation and sharing. Topics of metadata, ethical data sharing, and preservation are discussed in detail. What emerged from the work to date is the extremely complex nature of the coordination required among the agents; getting the timing right is both critical and difficult. Coordination among three agents is likely to be challenging under any circumstances and becomes more so when the trajectories of different life cycles, for research projects and for data sharing, are considered. Timescapes exposed some structural tensions that, although they can not be removed or eliminated, can be effectively managed

    Automating the anonymisation of textual corpora

    Get PDF
    [EU] Gaur egun, testu berriak etengabe sortzen doaz sare sozialetako mezu, osasun-txosten, dokumentu o zial eta halakoen ondorioz. Hala ere, testuok informazio pertsonala baldin badute, ezin dira ikerkuntzarako edota beste helburutarako baliatu, baldin eta aldez aurretik ez badira anonimizatzen. Anonimizatze hori automatikoki egitea erronka handia da eta askotan hutsetik anotatutako domeinukako datuak behar dira, ez baita arrunta helburutzat ditugun testuinguruetarako anotatutako corpusak izatea. Hala, tesi honek bi helburu ditu: (i) Gaztelaniazko elkarrizketa espontaneoz osatutako corpus anonimizatu berri bat konpilatu eta eskura jartzea, eta (ii) sortutako baliabide hau ustiatzea informazio sentiberaren identi kazio-teknikak aztertzeko, helburu gisa dugun domeinuan testu etiketaturik izan gabe. Hala, lehenengo helburuari lotuta, ES-Port izeneko corpusa sortu dugu. Telekomunikazio-ekoizle batek ahoz laguntza teknikoa ematen duenean sortu diren 1170 elkarrizketa espontaneoek osatzen dute corpusa. Ordezkatze-tekniken bidez anonimizatu da, eta ondorioz emaitza testu irakurgarri eta naturala izan da. Hamaika anonimizazio-kategoria landu dira, eta baita hizkuntzakoak eta hizkuntzatik kanpokoak diren beste zenbait anonimizazio-fenomeno ere, hala nola, kode-aldaketa, barrea, errepikapena, ahoskatze okerrak, eta abar. Bigarren helburuari lotuta, berriz, anonimizatu beharreko informazio sentibera identi katzeko, gordailuan oinarritutako Ikasketa Aktiboa erabili da, honek helburutzat baitu ahalik eta testu anotatu gutxienarekin sailkatzaile ahalik eta onena lortzea. Horretaz gain, emaitzak hobetzeko, eta abiapuntuko hautaketarako eta galderen hautaketarako estrategiak aztertzeko, Ezagutza Transferentzian oinarritutako teknikak ustiatu dira, aldez aurretik anotatuta zegoen corpus txiki bat oinarri hartuta. Emaitzek adierazi dute, lan honetan aukeratutako metodoak egokienak izan direla abiapuntuko hautaketa egiteko eta kontsulta-estrategia gisa iturri eta helburu sailkapenen zalantzak konbinatzeak Ikasketa Aktiboa hobetzen duela, ikaskuntza-kurba bizkorragoak eta sailkapen-errendimendu handiagoak lortuz iterazio gutxiagotan.[EN] A huge amount of new textual data are created day by day through social media posts, health records, official documents, and so on. However, if such resources contain personal data, they cannot be shared for research or other purposes without undergoing proper anonymisation. Automating such task is challenging and often requires labelling in-domain data from scratch since anonymised annotated corpora for the target scenarios are rarely available. This thesis has two main objectives: (i) to compile and provide a new corpus in Spanish with annotated anonymised spontaneous dialogue data, and (ii) to exploit the newly provided resource to investigate techniques for automating the sensitive data identification task, in a setting where initially no annotated data from the target domain are available. Following such aims, first, the ES-Port corpus is presented. It is a compilation of 1170 spontaneous spoken human-human dialogues from calls to the technical support service of a telecommunications provider. The corpus has been anonymised using the substitution technique, which implies the result is a readable natural text, and it contains annotations of eleven different anonymisation categories, as well as some linguistic and extra-linguistic phenomena annotations like code-switching, laughter, repetitions, mispronunciations, and so on. Next, the compiled corpus is used to investigate automatic sensitive data identification within a pool-based Active Learning framework, whose aim is to obtain the best possible classifier having to annotate as little data as possible. In order to improve such setting, Knowledge Transfer techniques from another small available anonymisation annotated corpus are explored for seed selection and query selection strategies. Results show that using the proposed seed selection methods obtain the best seeds on which to initialise the base learner's training and that combining source and target classifiers' uncertainties as query strategy improves the Active Learning process, deriving in steeper learning curves and reaching top classifier performance in fewer iterations

    Data Analysis Methods for Software Systems

    Get PDF
    Using statistics, econometrics, machine learning, and functional data analysis methods, we evaluate the consequences of the lockdown during the COVID-19 pandemics for wage inequality and unemployment. We deduce that these two indicators mostly reacted to the first lockdown from March till June 2020. Also, analysing wage inequality, we conduct analysis separately for males and females and different age groups.We noticed that young females were affected mostly by the lockdown.Nevertheless, all the groups reacted to the lockdown at some level

    Deep Learning for Mobile Mental Health: Challenges and recent advances

    Get PDF
    Mental health plays a key role in everyone’s day-to-day lives, impacting our thoughts, behaviours, and emotions. Also, over the past years, given its ubiquitous and affordable characteristics, the use of smartphones and wearable devices has grown rapidly and provided support within all aspects of mental health research and care, spanning from screening and diagnosis to treatment and monitoring, and attained significant progress to improve remote mental health interventions. While there are still many challenges to be tackled in this emerging cross-discipline research field, such as data scarcity, lack of personalisation, and privacy concerns, it is of primary importance that innovative signal processing and deep learning techniques are exploited. Particularly, recent advances in deep learning can help provide the key enabling technology for the development of the next-generation user-centric mobile mental health applications. In this article, we first brief basic principles associated with mobile device-based mental health analysis, review the main system components, and highlight conventional technologies involved. Next, we describe several major challenges and various deep learning technologies that have potentials for a strong contribution in dealing with these challenges, respectively. Finally, we discuss other remaining problems which need to be addressed via research collaboration across multiple disciplines.This paper has been partially funded by the Bavarian Ministry of Science and Arts as part of the Bavarian Research Association ForDigitHealth, the National Natural Science Foundation of China (Grant No. 62071330, 61702370), and the Key Program of the National Natural Science Foundation of China (Grant No: 61831022)

    Spoken Corpora Good Practice Guide 2006

    Get PDF
    International audienceThere is currently a vast amount of fundamental or applied research, which is based on the exploitation of oral corpora (organized recorded collections of oral and multimodal language productions). Created as a result of linguists becoming aware of the importance to ensure the durability of sources and a diversified access to the oral documents they produce, this Guide to good practice mainly deals with “oral corpora”, created for and used by linguists. But the questions raised by the creation and documentary exploitation of these corpora can be found in numerous disciplines: ethnology, anthropology, sociology, psychology, demography, oral history notably use oral surveys, testimonies, interviews, life stories. Based on a linguistic approach, this Guide also touches on the preoccupations of other researchers who use oral corpora (for example in the field of speech synthesis and recognition), even if their specific needs aren’t consistently dealt with in the present document

    On the Use of YouTube, Digital Games, Argument Maps, and Digital Feedback in Teaching Philosophy

    Get PDF
    We give an overview of the methodological possibilities of some important digital tools for teaching philosophy. Several didactically applicable methods have evolved in digital culture, including their implicit methodologies, theories about how these methods may be used. These methodologies are already applied by philosophers today and have their benefits and justifications in philosophy classes as well. They can help to solve known problems of philosophy education. We discuss problems of incomprehensibility and their possible solutions through digital explanations in pod- and videocasts such as YouTube; problems of interaction, motivation, and immersion that digital games and gamification may solve; problems of the complexity of philosophical content and digital concept- and argument-maps to deal with these; problems of implicitness and the possibility to make implicit things in philosophy class explicit through indirect feedback tools

    Big Data and Artificial Intelligence in Digital Finance

    Get PDF
    This open access book presents how cutting-edge digital technologies like Big Data, Machine Learning, Artificial Intelligence (AI), and Blockchain are set to disrupt the financial sector. The book illustrates how recent advances in these technologies facilitate banks, FinTech, and financial institutions to collect, process, analyze, and fully leverage the very large amounts of data that are nowadays produced and exchanged in the sector. To this end, the book also describes some more the most popular Big Data, AI and Blockchain applications in the sector, including novel applications in the areas of Know Your Customer (KYC), Personalized Wealth Management and Asset Management, Portfolio Risk Assessment, as well as variety of novel Usage-based Insurance applications based on Internet-of-Things data. Most of the presented applications have been developed, deployed and validated in real-life digital finance settings in the context of the European Commission funded INFINITECH project, which is a flagship innovation initiative for Big Data and AI in digital finance. This book is ideal for researchers and practitioners in Big Data, AI, banking and digital finance
    • …
    corecore