Search CORE

41 research outputs found

bot.zen @ EmpiriST 2015 - A minimally-deep learning PoS-tagger (trained for German CMC and Web data)

Author: Stemle Egon,
Publication venue: 'Association for Computational Linguistics (ACL)'
Publication date: 07/08/2016
Field of study

International audienceThis article describes the system that participated in the Part-of-speech tagging subtask of the EmpiriST 2015 shared task on automatic linguistic annotation of computer-mediated communication / social media. The system combines a small assertion of trending techniques, which implement matured methods, from NLP and ML to achieve competitive results on PoS tagging of German CMC and Web corpus data; in particular, the system uses word embeddings and character-level representations of word beginnings and endings in a LSTM RNN architecture. Labelled data (Tiger v2.2 and EmpiriST) and unlabelled data (German Wikipedia) were used for training. The system is available under the APLv2 open-source license

How FAIR are CMC Corpora?

Author: Alexander König
Egon Stemle
Jennifer-Carmen Frey
Publication venue: place:Cergy-Pontoise
Publication date: 01/01/2019
Field of study

In recent years, research data management has also become an important topic in the less data-intensive areas of the Social Sciences and Humanities (SSH). Funding agencies as well as research communities demand that empirical data collected and used for scientific research is managed and preserved in a way that research results are reproducible. In order to account for this the FAIR guiding principles for data stewardship have been established as a framework for good data management, aiming at the findability, accessibility, interoperability, and reusability of research data. This article investigates 24 European CMC corpora with regard to their compliance with the FAIR principles and discusses to what extent the deposit of research data in repositories of data preservation initiatives such as CLARIN, Zenodo or Metashare can assist in the provision of FAIR corpora

Archivio istituzionale della ricerca - Alma Mater Studiorum Università di Bologna

using language learner data for metaphor detection

Author: Alexander Onysko
Egon Stemle
Publication venue
Publication date: 01/01/2018
Field of study

Crossref

Open Access Repository

Introducing a Gold Standard Corpus from Young Multilinguals for the Evaluation of Automatic UD-PoS Taggers for Italian

Author: Frey Jennifer-Carmen
Schmalz Verena
Stemle Egon
Publication venue
Publication date: 01/01/2022
Field of study

Part-of-speech (PoS) tagging constitutes a common task in Natural Language Processing (NLP), given its widespread applicability. However, with the advance of new information technologies and language variation, the contents and methods for PoS-tagging have changed. The majority of Italian existing data for this task originate from standard texts, where language use is far from multifaceted informal real-life situations. Automatic PoS-tagging models trained with such data do not perform reliably on non-standard language, like social media content or language learners’ texts. Our aim is to provide additional training and evaluation data from language learners tagged in Universal Dependencies (UD), as well as testing current automatic PoStagging systems and evaluating their performance on such data. We use a multilingual corpus of young language learners, LEONIDE, to create a tagged gold standard for evaluating UD PoStagging performance on the Italian nonstandard language. With the 3.7 version of Stanza, a Python NLP package, we apply available automatic PoS-taggers, namely ISDT, ParTUT, POSTWITA, TWITTIRÒ and VIT, trained with both standard and non-standard data, on our dataset. Our results show that the above taggers, trained on non-standard data or multilingual Treebanks, can achieve up to 95% of accuracy on multilingual learner data, if combined.Part-of-speech (PoS) tagging constitutes a common task in Natural Language Processing (NLP), given its widespread applicability. However, with the advance of new information technologies and language variation, the contents and methods for PoS-tagging have changed. The majority of Italian existing data for this task originate from standard texts, where language use is far from multifaceted informal real-life situations. Automatic PoS-tagging models trained with such data do not perform reliably on non-standard language, like social media content or language learners’ texts. Our aim is to provide additional training and evaluation data from language learners tagged in Universal Dependencies (UD), as well as testing current automatic PoStagging systems and evaluating their performance on such data. We use a multilingual corpus of young language learners, LEONIDE, to create a tagged gold standard for evaluating UD PoStagging performance on the Italian nonstandard language. With the 3.7 version of Stanza, a Python NLP package, we apply available automatic PoS-taggers, namely ISDT, ParTUT, POSTWITA, TWITTIRÒ and VIT, trained with both standard and non-standard data, on our dataset. Our results show that the above taggers, trained on non-standard data or multilingual Treebanks, can achieve up to 95% of accuracy on multilingual learner data, if combined

Univerzitní repozitář Masarykovy univerzity

Working together towards an ideal infrastructure for language learner corpora

Author: Boyd Adriane
Jansen Maarten
Lindström Tiedemann Therese
Mikelić Preradović Nives
Rosen Alexandr
Rosén Dan
Stemle Egon W.
Volodina Elena
Publication venue: Presses universitaires de Louvain
Publication date: 01/01/2019
Field of study

Peer reviewe

Helsingin yliopiston digitaalinen arkisto

The PAIS? Corpus of Italian Web Texts

Author: Borghetti Claudia
Brunello Marco
Castagnoli Sara
Dell\u27Orletta Felice
Dittmann Henrik
Lenci Alessandro
Lyding Verena
Pirrelli Vito
Stemle Egon
Publication venue: Association for Computational Linguistics, Stroudsburg (Stati Uniti d\u27America)
Publication date
Field of study

PAISA\u27 is a Creative Commons licensed, large web corpus of contemporary Italian. We describe the design, harvesting, and processing steps involved in its creation

PUblication MAnagement

bot.zen @ EVALITA 2016 -A minimally-deep learning PoS-tagger (trained for Italian Tweets)

Author: Egon W Stemle
Publication venue
Publication date: 24/04/2020
Field of study

Abstract English. This article describes the system that participated in the POS tagging for Italian Social Media Texts (PoST-WITA) task of the 5 th periodic evaluation campaign of Natural Language Processing (NLP) and speech tools for the Italian language EVALITA 2016. The work is a continuation of Stemle (2016) with minor modifications to the system and different data sets. It combines a small assertion of trending techniques, which implement matured methods, from NLP and ML to achieve competitive results on PoS tagging of Italian Twitter texts; in particular, the system uses word embeddings and character-level representations of word beginnings and endings in a LSTM RNN architecture. Labelled data (Italian UD corpus, DiDi and PoSTWITA) and unlabbelled data (Italian C4Corpus and PAISÀ) were used for training. The system is available under the APLv2 open-source license. Italiano. Questo articolo descrive il sistema che ha partecipato al task POS tagging for Italian Social Media Texts (PoSTWita) nell'ambito di EVALITA 2016, la 5°campagna di valutazione periodica del Natural Language Processing (NLP) e delle tecnologie del linguaggio. Il lavoroè un proseguimento di quanto descritto in Stemle (2016), con modifiche minime al sistema e insiemi di dati differenti. Il lavoro combina alcune tecniche correnti che implementano metodi comprovati dell'NLP e del Machine Learning, per raggiungere risultati competitivi nel PoS tagging dei testi italiani di Twitter. In particolare il sistema utilizza strategie di word embedding e di rappresentazione character-level di inizio e fine parola, in un'architettura LSTM RNN. Dati etichettati (Italian UD corpus, DiDi e PoSTWITA) e dati non etichettati (Italian C4Corpus e PAISÀ) sono stati utilizzati in fase di training. Il sistemaè disponibile sotto licenza open source APLv2

CiteSeerX

Ludwig & Gertrude

Author: Melzer Tine
Stemle Egon
Publication venue: edition taberna kritika
Publication date: 01/01/2021
Field of study

Dieses Buch feiert die Lust der Wiederholung der Gertrude Stein und setzt sie in Beziehung zu Ludwig Wittgensteins Hypothese, dass die Bedeutung eines Wortes in seinem Gebrauch liegt. Es enthält deren gemeinsames Vokabular, das gemeinsame Schnittpunkte darstellt, in denen sich zwei Autorschaften – wortwörtlich – treffen. Als Shared Vocabulary wird der gemeinsame Wortschatz zweier Hauptwerke des österreichisch-britischen Philosophen und der amerikanischen Schriftstellerin, nämlich Tractatus Logico-Philosophicus (1922) und Tender Buttons (1914) visualisiert. Beide Texte wurden mit Hilfe eines Computerprogramms aus dem Bereich komparativer Korpuslinguistik verglichen und zur Überlappung gebracht. Alle Worte, die in beiden Werken vorkommen sind fett, alle anderen, die jeweils nur in einem der beiden Werke vorkommen, sind regulär gesetzt – wobei nicht gekennzeichnet ist, in welchem. Ludwig & Gertrude wurde entwickelt im Rahmen des transdisziplinären PhD Projektes Ludwig Wittgenstein & Gertrude Stein – Meeting in Language. Mehr zur Entstehung, dem Vorgehen und den Motiven für dieses Shared Vocabulary in Taxidermy for Language-Animals (Rollo Press, Zürich 2016 / 2020). Eine frühere Zusammenarbeit von Egon Stemle und Tine Melzer ist The Complete Dictionary (2003)

Berner Fachhochschule: ARBOR

Harvesting Relations from the Web - Quantifiying the Impact of Filtering Functions

Author: Blohm Sebastian
Cimiano Philipp
Stemle Egon
Publication venue: 'Association for the Advancement of Artificial Intelligence (AAAI)'
Publication date: 01/01/2007
Field of study

Blohm S, Cimiano P, Stemle E. Harvesting Relations from the Web - Quantifiying the Impact of Filtering Functions. In: Proceedings of the 22nd Conference on Artificial Intelligence (AAAI-07). Association for the Advancement of Artificial Intelligence (AAAI); 2007: 1316-1323

Publications at Bielefeld University

Paddy WaC: a minimally-supervised web-corpus of Hiberno-English

Author: Murphy Brian
Stemle Egon
Publication venue: 'Association for Computational Linguistics (ACL)'
Publication date: 01/01/2011
Field of study

Small, manually assembled corpora may be available for less dominant languages and dialects, but producing web-scale resources remains a challenge. Even when considerable quantities of text are present on the web, finding this text, and distinguishing it from related languages in the same region can be difficult. For example less dominant variants of English (e.g. New Zealander, Singaporean, Canadian, Irish, South African) may be found under their respective national domains, but will be partially mixed with Englishes of the British and US varieties, perhaps through syndication of journalism, or the local reuse of text by multinational companies. Less formal dialectal usage may be scattered more widely over the internet through mechanisms such as wiki or blog authoring. Here we automatically construct a corpus of Hiberno-English (English as spoken in Ireland) using a variety of methods: filtering by national domain, filtering by orthographic conventions, and bootstrapping from a set of Irelandspecific terms (slang, place names, organisations). We evaluate the national specificity of the resulting corpora by measuring the incidence of topical terms, and several grammatical constructions that are particular to Hiberno-English. The results show that domain filtering is very effective for isolating text that is topic-specific, and orthographic classification can exclude some non-Irish texts, but that selected seeds are necessary to extract considerable quantities of more informal, dialectal text.

Queen's University Belfast Research Portal

CiteSeerX