77 research outputs found
Introducing a Gold Standard Corpus from Young Multilinguals for the Evaluation of Automatic UD-PoS Taggers for Italian
Part-of-speech (PoS) tagging constitutes a common task in Natural Language Processing (NLP), given its widespread applicability. However, with the advance of new information technologies and language variation, the contents and methods for PoS-tagging have changed. The majority of Italian existing data for this task originate from standard texts, where language use is far from multifaceted informal real-life situations. Automatic PoS-tagging models trained with such data do not perform reliably on non-standard language, like social media content or language learners’ texts. Our aim is to provide additional training and evaluation data from language learners tagged in Universal Dependencies (UD), as well as testing current automatic PoStagging systems and evaluating their performance on such data. We use a multilingual corpus of young language learners, LEONIDE, to create a tagged gold standard for evaluating UD PoStagging performance on the Italian nonstandard language. With the 3.7 version of Stanza, a Python NLP package, we apply available automatic PoS-taggers, namely ISDT, ParTUT, POSTWITA, TWITTIRÒ and VIT, trained with both standard and non-standard data, on our dataset. Our results show that the above taggers, trained on non-standard data or multilingual Treebanks, can achieve up to 95% of accuracy on multilingual learner data, if combined.Part-of-speech (PoS) tagging constitutes a common task in Natural Language Processing (NLP), given its widespread applicability. However, with the advance of new information technologies and language variation, the contents and methods for PoS-tagging have changed. The majority of Italian existing data for this task originate from standard texts, where language use is far from multifaceted informal real-life situations. Automatic PoS-tagging models trained with such data do not perform reliably on non-standard language, like social media content or language learners’ texts. Our aim is to provide additional training and evaluation data from language learners tagged in Universal Dependencies (UD), as well as testing current automatic PoStagging systems and evaluating their performance on such data. We use a multilingual corpus of young language learners, LEONIDE, to create a tagged gold standard for evaluating UD PoStagging performance on the Italian nonstandard language. With the 3.7 version of Stanza, a Python NLP package, we apply available automatic PoS-taggers, namely ISDT, ParTUT, POSTWITA, TWITTIRÒ and VIT, trained with both standard and non-standard data, on our dataset. Our results show that the above taggers, trained on non-standard data or multilingual Treebanks, can achieve up to 95% of accuracy on multilingual learner data, if combined
How FAIR are CMC Corpora?
In recent years, research data management has also become an important topic in the less data-intensive areas of the Social Sciences and Humanities (SSH). Funding agencies as well as research communities demand that empirical data collected and used for scientific research is managed and preserved in a way that research results are reproducible. In order to account for this the FAIR guiding principles for data stewardship have been established as a framework for good data management, aiming at the findability, accessibility, interoperability, and reusability of research data. This article investigates 24 European CMC corpora with regard to their compliance with the FAIR principles and discusses to what extent the deposit of research data in repositories of data preservation initiatives such as CLARIN, Zenodo or Metashare can assist in the provision of FAIR corpora
Using data mining to repurpose German language corpora. An evaluation of data-driven analysis methods for corpus linguistics
A growing number of studies report interesting insights gained from existing data resources. Among those, there are analyses on textual data, giving reason to consider such methods for linguistics as well. However, the field of corpus linguistics usually works with purposefully collected, representative language samples that aim to answer only a limited set of research questions.
This thesis aims to shed some light on the potentials of data-driven analysis based on machine learning and predictive modelling for corpus linguistic studies, investigating the possibility to repurpose existing German language corpora for linguistic inquiry by using methodologies developed for data science and computational linguistics. The study focuses on predictive modelling and machine-learning-based data mining and gives a detailed overview and evaluation of currently popular strategies and methods for analysing corpora with computational methods.
After the thesis introduces strategies and methods that have already been used on language data, discusses how they can assist corpus linguistic analysis and refers to available toolkits and software as well as to state-of-the-art research and further references, the introduced methodological toolset is applied in two differently shaped corpus studies that utilize readily available corpora for German. The first study explores linguistic correlates of holistic text quality ratings on student essays, while the second deals with age-related language features in computer-mediated communication and interprets age prediction models to answer a set of research questions that are based on previous research in the field. While both studies give linguistic insights that integrate into the current understanding of the investigated phenomena in German language, they systematically test the methodological toolset introduced beforehand, allowing a detailed discussion of added values and remaining challenges of machine-learning-based data mining methods in corpus at the end of the thesis
DIALEKTALE SPRACHREALITĂ„TEN ĂśBER CMC-KORPORA ERLEBEN. Das DiDi-Korpus zur internetbasierten Kommunikation aus SĂĽdtirol im DaZ-Unterricht
Der Beitrag stellt das DiDi-Korpus zur internetbasierten Kommunikation (IBK) aus Südtirol (vgl. Frey et al. 2016) vor und diskutiert seinen potenziellen Wert für den DaZ-Unterricht. Das Korpus besteht aus etwa 40.000 Facebooktexten von privaten NutzerInnen und zeigt einen authentischen Ausschnitt der schriftlichen digitalen Kommunikation in Südtirol, die sich neben typischen Merkmalen der IBK auch durch die Nutzung mehrerer Varietäten (z.B. Standarddeutsch und Dialekt) und Sprachen (z.B. Deutsch und Italienisch) auszeichnet (vgl. Glaznieks / Frey 2018). Der Beitrag fokussiert auf den schriftlichen Gebrauch mehrerer Varietäten in der IBK und stellt den soziokulturellen und kommunikativ-pragmatischen Mehrwert des DiDi-Korpus als authentische Sprachressource für DaZ-Lernende heraus
Verschriftungsprinzipien im geschriebenen Dialekt: WhatsApp-Nachrichten aus SĂĽdtirol
Although the orthographic norm of the standard language has dominated most written registers of German, social media and other non-standardized digital contexts have recently given rise to the written use of dialects and vernaculars in informal communication. The written use of non-standard dialects is especially wide-spread in the south of the German-speaking area, e. g. in Switzerland, Bavaria, Austria and South Tyrol, where various studies have reported the use of dialectal forms in digital contexts such as Chatrooms, WhatsApp or Facebook (cf. Christen 2004; Glaznieks/Frey 2018). Using a corpus of South Tyrolean WhatsApp chats with corresponding audio recordings of the chat authors retelling the chat contents, we analyze four phonetic-phonological phenomena of Tyrolean dialects, characteristic of the southern German-speaking area: pre-consonantal s-retraction, the neutralization of the phonemes /p/ and /b/ in word-initial position, vowels undergoing umlaut or unrounding and the realisation of r in the coda of unstressed syllables. In particular, we analyze if and how these phenomena of the dialect are represented in the written form. The results show that Standard German graphematic conventions form the basis for most dialect spellings in WhatsApp chats. However, they are sometimes abandoned for the benefit of spellings that explicitly represent dialectal pronunciations. Interestingly, in some cases these dialectal spellings do not correspond to the pronunciation of the writers who, instead, opt for a pronunciation closer to that of the standard language
Antimicrobial resistance among migrants in Europe: a systematic review and meta-analysis
BACKGROUND: Rates of antimicrobial resistance (AMR) are rising globally and there is concern that increased migration is contributing to the burden of antibiotic resistance in Europe. However, the effect of migration on the burden of AMR in Europe has not yet been comprehensively examined. Therefore, we did a systematic review and meta-analysis to identify and synthesise data for AMR carriage or infection in migrants to Europe to examine differences in patterns of AMR across migrant groups and in different settings. METHODS: For this systematic review and meta-analysis, we searched MEDLINE, Embase, PubMed, and Scopus with no language restrictions from Jan 1, 2000, to Jan 18, 2017, for primary data from observational studies reporting antibacterial resistance in common bacterial pathogens among migrants to 21 European Union-15 and European Economic Area countries. To be eligible for inclusion, studies had to report data on carriage or infection with laboratory-confirmed antibiotic-resistant organisms in migrant populations. We extracted data from eligible studies and assessed quality using piloted, standardised forms. We did not examine drug resistance in tuberculosis and excluded articles solely reporting on this parameter. We also excluded articles in which migrant status was determined by ethnicity, country of birth of participants' parents, or was not defined, and articles in which data were not disaggregated by migrant status. Outcomes were carriage of or infection with antibiotic-resistant organisms. We used random-effects models to calculate the pooled prevalence of each outcome. The study protocol is registered with PROSPERO, number CRD42016043681. FINDINGS: We identified 2274 articles, of which 23 observational studies reporting on antibiotic resistance in 2319 migrants were included. The pooled prevalence of any AMR carriage or AMR infection in migrants was 25·4% (95% CI 19·1-31·8; I2 =98%), including meticillin-resistant Staphylococcus aureus (7·8%, 4·8-10·7; I2 =92%) and antibiotic-resistant Gram-negative bacteria (27·2%, 17·6-36·8; I2 =94%). The pooled prevalence of any AMR carriage or infection was higher in refugees and asylum seekers (33·0%, 18·3-47·6; I2 =98%) than in other migrant groups (6·6%, 1·8-11·3; I2 =92%). The pooled prevalence of antibiotic-resistant organisms was slightly higher in high-migrant community settings (33·1%, 11·1-55·1; I2 =96%) than in migrants in hospitals (24·3%, 16·1-32·6; I2 =98%). We did not find evidence of high rates of transmission of AMR from migrant to host populations. INTERPRETATION: Migrants are exposed to conditions favouring the emergence of drug resistance during transit and in host countries in Europe. Increased antibiotic resistance among refugees and asylum seekers and in high-migrant community settings (such as refugee camps and detention facilities) highlights the need for improved living conditions, access to health care, and initiatives to facilitate detection of and appropriate high-quality treatment for antibiotic-resistant infections during transit and in host countries. Protocols for the prevention and control of infection and for antibiotic surveillance need to be integrated in all aspects of health care, which should be accessible for all migrant groups, and should target determinants of AMR before, during, and after migration. FUNDING: UK National Institute for Health Research Imperial Biomedical Research Centre, Imperial College Healthcare Charity, the Wellcome Trust, and UK National Institute for Health Research Health Protection Research Unit in Healthcare-associated Infections and Antimictobial Resistance at Imperial College London
- …