41 research outputs found

    bot.zen @ EmpiriST 2015 - A minimally-deep learning PoS-tagger (trained for German CMC and Web data)

    Get PDF
    International audienceThis article describes the system that participated in the Part-of-speech tagging subtask of the EmpiriST 2015 shared task on automatic linguistic annotation of computer-mediated communication / social media. The system combines a small assertion of trending techniques, which implement matured methods, from NLP and ML to achieve competitive results on PoS tagging of German CMC and Web corpus data; in particular, the system uses word embeddings and character-level representations of word beginnings and endings in a LSTM RNN architecture. Labelled data (Tiger v2.2 and EmpiriST) and unlabelled data (German Wikipedia) were used for training. The system is available under the APLv2 open-source license

    How FAIR are CMC Corpora?

    Get PDF
    In recent years, research data management has also become an important topic in the less data-intensive areas of the Social Sciences and Humanities (SSH). Funding agencies as well as research communities demand that empirical data collected and used for scientific research is managed and preserved in a way that research results are reproducible. In order to account for this the FAIR guiding principles for data stewardship have been established as a framework for good data management, aiming at the findability, accessibility, interoperability, and reusability of research data. This article investigates 24 European CMC corpora with regard to their compliance with the FAIR principles and discusses to what extent the deposit of research data in repositories of data preservation initiatives such as CLARIN, Zenodo or Metashare can assist in the provision of FAIR corpora

    Introducing a Gold Standard Corpus from Young Multilinguals for the Evaluation of Automatic UD-PoS Taggers for Italian

    Get PDF
    Part-of-speech (PoS) tagging constitutes a common task in Natural Language Processing (NLP), given its widespread applicability. However, with the advance of new information technologies and language variation, the contents and methods for PoS-tagging have changed. The majority of Italian existing data for this task originate from standard texts, where language use is far from multifaceted informal real-life situations. Automatic PoS-tagging models trained with such data do not perform reliably on non-standard language, like social media content or language learners’ texts. Our aim is to provide additional training and evaluation data from language learners tagged in Universal Dependencies (UD), as well as testing current automatic PoStagging systems and evaluating their performance on such data. We use a multilingual corpus of young language learners, LEONIDE, to create a tagged gold standard for evaluating UD PoStagging performance on the Italian nonstandard language. With the 3.7 version of Stanza, a Python NLP package, we apply available automatic PoS-taggers, namely ISDT, ParTUT, POSTWITA, TWITTIRÒ and VIT, trained with both standard and non-standard data, on our dataset. Our results show that the above taggers, trained on non-standard data or multilingual Treebanks, can achieve up to 95% of accuracy on multilingual learner data, if combined.Part-of-speech (PoS) tagging constitutes a common task in Natural Language Processing (NLP), given its widespread applicability. However, with the advance of new information technologies and language variation, the contents and methods for PoS-tagging have changed. The majority of Italian existing data for this task originate from standard texts, where language use is far from multifaceted informal real-life situations. Automatic PoS-tagging models trained with such data do not perform reliably on non-standard language, like social media content or language learners’ texts. Our aim is to provide additional training and evaluation data from language learners tagged in Universal Dependencies (UD), as well as testing current automatic PoStagging systems and evaluating their performance on such data. We use a multilingual corpus of young language learners, LEONIDE, to create a tagged gold standard for evaluating UD PoStagging performance on the Italian nonstandard language. With the 3.7 version of Stanza, a Python NLP package, we apply available automatic PoS-taggers, namely ISDT, ParTUT, POSTWITA, TWITTIRÒ and VIT, trained with both standard and non-standard data, on our dataset. Our results show that the above taggers, trained on non-standard data or multilingual Treebanks, can achieve up to 95% of accuracy on multilingual learner data, if combined

    The PAIS? Corpus of Italian Web Texts

    Get PDF
    PAISA\u27 is a Creative Commons licensed, large web corpus of contemporary Italian. We describe the design, harvesting, and processing steps involved in its creation

    bot.zen @ EVALITA 2016 -A minimally-deep learning PoS-tagger (trained for Italian Tweets)

    No full text
    Abstract English. This article describes the system that participated in the POS tagging for Italian Social Media Texts (PoST-WITA) task of the 5 th periodic evaluation campaign of Natural Language Processing (NLP) and speech tools for the Italian language EVALITA 2016. The work is a continuation of Stemle (2016) with minor modifications to the system and different data sets. It combines a small assertion of trending techniques, which implement matured methods, from NLP and ML to achieve competitive results on PoS tagging of Italian Twitter texts; in particular, the system uses word embeddings and character-level representations of word beginnings and endings in a LSTM RNN architecture. Labelled data (Italian UD corpus, DiDi and PoSTWITA) and unlabbelled data (Italian C4Corpus and PAISÀ) were used for training. The system is available under the APLv2 open-source license. Italiano. Questo articolo descrive il sistema che ha partecipato al task POS tagging for Italian Social Media Texts (PoSTWita) nell'ambito di EVALITA 2016, la 5°campagna di valutazione periodica del Natural Language Processing (NLP) e delle tecnologie del linguaggio. Il lavoroù un proseguimento di quanto descritto in Stemle (2016), con modifiche minime al sistema e insiemi di dati differenti. Il lavoro combina alcune tecniche correnti che implementano metodi comprovati dell'NLP e del Machine Learning, per raggiungere risultati competitivi nel PoS tagging dei testi italiani di Twitter. In particolare il sistema utilizza strategie di word embedding e di rappresentazione character-level di inizio e fine parola, in un'architettura LSTM RNN. Dati etichettati (Italian UD corpus, DiDi e PoSTWITA) e dati non etichettati (Italian C4Corpus e PAISÀ) sono stati utilizzati in fase di training. Il sistemaù disponibile sotto licenza open source APLv2

    Ludwig & Gertrude

    No full text
    Dieses Buch feiert die Lust der Wiederholung der Gertrude Stein und setzt sie in Beziehung zu Ludwig Wittgensteins Hypothese, dass die Bedeutung eines Wortes in seinem Gebrauch liegt. Es enthĂ€lt deren gemeinsames Vokabular, das gemeinsame Schnittpunkte darstellt, in denen sich zwei Autorschaften – wortwörtlich – treffen. Als Shared Vocabulary wird der gemeinsame Wortschatz zweier Hauptwerke des österreichisch-britischen Philosophen und der amerikanischen Schriftstellerin, nĂ€mlich Tractatus Logico-Philosophicus (1922) und Tender Buttons (1914) visualisiert. Beide Texte wurden mit Hilfe eines Computerprogramms aus dem Bereich komparativer Korpuslinguistik verglichen und zur Überlappung gebracht. Alle Worte, die in beiden Werken vorkommen sind fett, alle anderen, die jeweils nur in einem der beiden Werke vorkommen, sind regulĂ€r gesetzt – wobei nicht gekennzeichnet ist, in welchem. Ludwig & Gertrude wurde entwickelt im Rahmen des transdisziplinĂ€ren PhD Projektes Ludwig Wittgenstein & Gertrude Stein – Meeting in Language. Mehr zur Entstehung, dem Vorgehen und den Motiven für dieses Shared Vocabulary in Taxidermy for Language-Animals (Rollo Press, Zürich 2016 / 2020). Eine frühere Zusammenarbeit von Egon Stemle und Tine Melzer ist The Complete Dictionary (2003)

    Harvesting Relations from the Web - Quantifiying the Impact of Filtering Functions

    Get PDF
    Blohm S, Cimiano P, Stemle E. Harvesting Relations from the Web - Quantifiying the Impact of Filtering Functions. In: Proceedings of the 22nd Conference on Artificial Intelligence (AAAI-07). Association for the Advancement of Artificial Intelligence (AAAI); 2007: 1316-1323

    Paddy WaC: a minimally-supervised web-corpus of Hiberno-English

    No full text
    Small, manually assembled corpora may be available for less dominant languages and dialects, but producing web-scale resources remains a challenge. Even when considerable quantities of text are present on the web, finding this text, and distinguishing it from related languages in the same region can be difficult. For example less dominant variants of English (e.g. New Zealander, Singaporean, Canadian, Irish, South African) may be found under their respective national domains, but will be partially mixed with Englishes of the British and US varieties, perhaps through syndication of journalism, or the local reuse of text by multinational companies. Less formal dialectal usage may be scattered more widely over the internet through mechanisms such as wiki or blog authoring. Here we automatically construct a corpus of Hiberno-English (English as spoken in Ireland) using a variety of methods: filtering by national domain, filtering by orthographic conventions, and bootstrapping from a set of Irelandspecific terms (slang, place names, organisations). We evaluate the national specificity of the resulting corpora by measuring the incidence of topical terms, and several grammatical constructions that are particular to Hiberno-English. The results show that domain filtering is very effective for isolating text that is topic-specific, and orthographic classification can exclude some non-Irish texts, but that selected seeds are necessary to extract considerable quantities of more informal, dialectal text.
    corecore