Search CORE

17 research outputs found

Working together towards an ideal infrastructure for language learner corpora

Author: Boyd Adriane
Jansen Maarten
Lindström Tiedemann Therese
Mikelić Preradović Nives
Rosen Alexandr
Rosén Dan
Stemle Egon W.
Volodina Elena
Publication venue: Presses universitaires de Louvain
Publication date: 01/01/2019
Field of study

Peer reviewe

Helsingin yliopiston digitaalinen arkisto

bot.zen @ EVALITA 2016 -A minimally-deep learning PoS-tagger (trained for Italian Tweets)

Author: Egon W Stemle
Publication venue
Publication date: 24/04/2020
Field of study

Abstract English. This article describes the system that participated in the POS tagging for Italian Social Media Texts (PoST-WITA) task of the 5 th periodic evaluation campaign of Natural Language Processing (NLP) and speech tools for the Italian language EVALITA 2016. The work is a continuation of Stemle (2016) with minor modifications to the system and different data sets. It combines a small assertion of trending techniques, which implement matured methods, from NLP and ML to achieve competitive results on PoS tagging of Italian Twitter texts; in particular, the system uses word embeddings and character-level representations of word beginnings and endings in a LSTM RNN architecture. Labelled data (Italian UD corpus, DiDi and PoSTWITA) and unlabbelled data (Italian C4Corpus and PAISÀ) were used for training. The system is available under the APLv2 open-source license. Italiano. Questo articolo descrive il sistema che ha partecipato al task POS tagging for Italian Social Media Texts (PoSTWita) nell'ambito di EVALITA 2016, la 5°campagna di valutazione periodica del Natural Language Processing (NLP) e delle tecnologie del linguaggio. Il lavoroè un proseguimento di quanto descritto in Stemle (2016), con modifiche minime al sistema e insiemi di dati differenti. Il lavoro combina alcune tecniche correnti che implementano metodi comprovati dell'NLP e del Machine Learning, per raggiungere risultati competitivi nel PoS tagging dei testi italiani di Twitter. In particolare il sistema utilizza strategie di word embedding e di rappresentazione character-level di inizio e fine parola, in un'architettura LSTM RNN. Dati etichettati (Italian UD corpus, DiDi e PoSTWITA) e dati non etichettati (Italian C4Corpus e PAISÀ) sono stati utilizzati in fase di training. Il sistemaè disponibile sotto licenza open source APLv2

CiteSeerX

Proceedings of the 5th Conference on CMC and Social Media Corpora for the Humanities (cmccorpora17)

Author: Stemle Egon, W.
Wigham Ciara R.
Publication venue: cmc-corpora conference series
Publication date: 30/09/2017
Field of study

International audienceThis volume presents the proceedings of the 5th edition of the annual conference series on CMC and Social Media Corpora for the Humanities (cmc-corpora2017). This conference series is dedicated to the collection, annotation, processing, and exploitation of corpora of computer-mediated communication (CMC) and social media for research in the humanities. The annual event brings together language-centered research on CMC and social media in linguistics, philologies, communication sciences, media and social sciences with research questions from the fields of corpus and computational linguistics, language technology, text technology, and machine learning.The 5th Conference on CMC and Social Media Corpora for the Humanities was held at Eurac Research on October, 4th and 5th, in Bolzano, Italy. This volume contains extended abstracts of the invited talks, papers, and extended abstracts of posters presented at the event. The conference attracted 26 valid submissions. Each submission was reviewed by at least two members of the scientific committee. This committee decided to accept 16 papers and 8 posters of which 14 papers and 3 posters were presented at the conference. The programme also includes three invited talks: two keynote talks by Aivars Glaznieks (Eurac Research, Italy) and A. Seza Doğruöz (Independent researcher) and an invited talk on the Common Language Resources and Technology Infrastructure (CLARIN) given by Darja Fišer, the CLARIN ERIC Director of User Involvement

HAL-ENS-LYON

ZENODO

HAL Clermont Université

HAL

NEUROSURGERY ENTHUSIASTIC WOMEN SOCIETY

Exploring Reusability and Reproducibility for a Research Infrastructure for L1 and L2 Learner Corpora

Author: Alexander König
Egon W. Stemle
Jennifer-Carmen Frey
Publication venue: 'MDPI AG'
Publication date: 30/04/2021
Field of study

Up until today research in various educational and linguistic domains such as learner corpus research, writing research, or second language acquisition has produced a substantial amount of research data in the form of L1 and L2 learner corpora. However, the multitude of individual solutions combined with domain-inherent obstacles in data sharing have so far hampered comparability, reusability and reproducibility of data and research results. In this article, we present work in creating a digital infrastructure for L1 and L2 learner corpora and populating it with data collected in the past. We embed our infrastructure efforts in the broader field of infrastructures for scientific research, drawing from technical solutions and frameworks from research data management, among which the FAIR guiding principles for data stewardship. We share our experiences from integrating some L1 and L2 learner corpora from concluded projects into the infrastructure while trying to ensure compliance with the FAIR principles and the standards we established for reproducibility, discussing how far research data that has been collected in the past can be made comparable, reusable and reproducible. Our results show that some basic needs for providing comparable and reusable data are covered by existing general infrastructure solutions and can be exploited for domain-specific infrastructures such as the one presented in this article. Other aspects need genuinely domain-driven approaches. The solutions found for the corpora in the presented infrastructure can only be a preliminary attempt, and further community involvement would be needed to provide templates and models acknowledged and promoted by the community. Furthermore, forward-looking data management would be needed starting from the beginning of new corpus creation projects to ensure that all requirements for FAIR data can be met

Multidisciplinary Digital Publishing Institute

The DiDi Corpus of South Tyrolean CMC Data: A multilingual corpus of Facebook texts

Author: Aivars Glaznieks
Egon W Stemle
Jennifer-Carmen Frey
Publication venue
Publication date: 05/03/2020
Field of study

Abstract English. The DiDi corpus of South Tyrolean data of computer-mediated communication (CMC) is a multilingual sociolinguistic language corpus. It consists of around 600,000 tokens collected from 136 profiles of Facebook users residing in South Tyrol, Italy. In conformity with the multilingual situation of the territory, the main languages of the corpus are German and Italian (followed by English). The data has been manually anonymised and provides manually corrected part-ofspeech tags for the Italian language texts and manually normalised data for German texts. Moreover, it is annotated with userprovided socio-demographic data (among others L1, gender, age, education, and internet communication habits) from a questionnaire, and linguistic annotations regarding CMC phenomena, languages and varieties. The anonymised corpus is freely available for research purposes. Italiano. DiDiè un corpus di comunicazione mediata dal computer (CMC), che raccoglie dati linguistici di area sudtirolese. Il corpus, multilingue e sociolinguistico,è composto da circa 600,000 occorrenze raccolte (previo consenso all'utilizzo dei dati) dai profili di 136 iscritti a Facebook e residenti in Alto Adige. Le principali lingue del corpus, tedesco e italiano (seguite dall'inglese), riflettono lo spazio plurilingue del territorio. I dati sono stati manualmente anonimizzati e i testi in lingua italiana sono corredati da etichette (manualmente corrette) per le parti del discorso. Inoltre, DiDiè annotato con dati sociodemografici forniti dall'utente (fra gli altri: L1, genere, età, istruzione e modalità di comunicazione via Internet) attraverso un questionario e contiene ulteriori annotazioni linguistiche relative a fenomeni legati alla CMC e agli usi di varietà linguistiche. Il corpus anonimizzatoè liberamente disponibile a fini di ricerca

CiteSeerX

Collecting language data of non-public social media profiles

Author: Frey Jennifer-Carmen
Glaznieks Aivars
Stemle Egon W.
Publication venue
Publication date: 25/11/2014
Field of study

In this paper, we propose an integrated web strategy for mixed sociolinguistic research methodologies in the context of social media corpora. After stating the particular challenges for building corpora of private, non-public computer-mediated communication, we will present our solution to these problems: a Facebook web application for the acquisition of such data and the corresponding meta data. Finally, we will discuss positive and negative implications for this method

HilPub (Univ. Hildesheim)

Archivio istituzionale della ricerca - Alma Mater Studiorum Università di Bologna

The DiDi Corpus of South Tyrolean CMC Data: A multilingual corpus of Facebook texts

Author: Frey Jennifer-Carmen
Glaznieks Aivars
Stemle Egon W.
Publication venue: place:Torino
Publication date: 01/01/2016
Field of study

The DiDi corpus of South Tyrolean data of computer-mediated communication (CMC) is a multilingual sociolinguistic language corpus. It consists of around 600,000 tokens collected from 136 profiles of Facebook users residing in South Tyrol, Italy. In conformity with the multilingual situation of the territory, the main languages of the corpus are German and Italian (followed by English). The data has been manually anonymised and provides manually corrected part-of-speech tags for the Italian language texts and manually normalised data for German texts. Moreover, it is annotated with userprovided socio-demographic data (among others L1, gender, age, education, and internet communication habits) from a questionnaire, and linguistic annotations regarding CMC phenomena, languages and varieties. The anonymised corpus is freely available for research purposes.DiDi è un corpus di comunicazione mediata dal computer (CMC), che raccoglie dati linguistici di area sudtirolese. Il corpus, multilingue e sociolinguistico, è composto da circa 600,000 occorrenze raccolte (previo consenso all’utilizzo dei dati) dai profili di 136 iscritti a Facebook e residenti in Alto Adige. Le principali lingue del corpus, tedesco e italiano (seguite dall’inglese), riflettono lo spazio plurilingue del territorio. I dati sono stati manualmente anonimizzati e i testi in lingua italiana sono corredati da etichette (manualmente corrette) per le parti del discorso. Inoltre, DiDi è annotato con dati sociodemografici forniti dall’utente (fra gli altri: L1, genere, età, istruzione e modalità di comunicazione via Internet) attraverso un questionario e contiene ulteriori annotazioni linguistiche relative a fenomeni legati alla CMC e agli usi di varietà linguistiche. Il corpus anonimizzato è liberamente disponibile a fini di ricerca

Crossref

Archivio istituzionale della ricerca - Alma Mater Studiorum Università di Bologna

OpenEdition

The DiDi Corpus of South Tyrolean CMC Data

Author: Frey Jennifer-Carmen
Glaznieks Aivars
Stemle Egon W.
Publication venue
Publication date: 01/01/2015
Field of study

Archivio istituzionale della ricerca - Alma Mater Studiorum Università di Bologna

Structure-Preserving Pipelines for Digital Libraries

Author: Christian Girardi
Eduard Barbu
Egon W. Stemle
Massimo Poesio
Publication venue
Publication date: 01/01/2011
Field of study

Most existing HLT pipelines assume the input is pure text or, at most, HTML and either ignore (logical) document structure or remove it. We argue that identifying the structure of documents is essential in digital library and other types of applications, and show that it is relatively straightforward to extend existing pipelines to achieve ones in which the structure of a document is preserved.

CiteSeerX

Archivio della ricerca - Fondazione Bruno Kessler

Comparison of automatic vs. manual language identification in multilingual social media texts

Author: Doğruöz A. Seza
Frey Jennifer-Carmen
Stemle Egon W.
Publication venue: Presses universitaires Blaise Pascal
Publication date: 01/01/2019
Field of study

Multilingual speakers communicate in more than one language in daily life and on social media. In order to process or investigate multilingual communication, there is a need for language identification. This study compares the performance of human annotators with automatic ways of language identification on a multilingual (mainly German-Italian-English) social media corpus collected in South Tyrol, Italy. Our results indicate that humans and Natural Language Processing (NLP) systems follow their individual techniques to make a decision about multilingual text messages. This results in low agreement when different annotators or NLP systems execute the same task. In general, annotators agree with each other more than NLP systems. However, there is also variation in human agreement depending on the prior establishment of guidelines for the annotation task or not

Ghent University Academic Bibliography

Archivio istituzionale della ricerca - Alma Mater Studiorum Università di Bologna