Search CORE

42 research outputs found

実応用を志向した機械翻訳システムの設計と評価

Author: 阿部香央莉
Publication venue
Publication date: 24/03/2023
Field of study

Tohoku University博士（情報科学）thesi

Tohoku University Repository (TOUR) / 東北大学機関リポジトリ

Digitising Swiss German : how to process and study a polycentric spoken language

Author: Glaser Elvira
Samardžić Tanja
Scherrer Yves
Publication venue
Publication date: 29/11/2019
Field of study

Swiss dialects of German are, unlike many dialects of other standardised languages, widely used in everyday communication. Despite this fact, automatic processing of Swiss German is still a considerable challenge due to the fact that it is mostly a spoken variety and that it is subject to considerable regional variation. This paper presents the ArchiMob corpus, a freely available general-purpose corpus of spoken Swiss German based on oral history interviews. The corpus is a result of a long design process, intensive manual work and specially adapted computational processing. We first present the modalities of access of the corpus for linguistic, historic and computational research. We then describe how the documents were transcribed, segmented and aligned with the sound source. This work involved a series of experiments that have led to automatically annotated normalisation and part-of-speech tagging layers. Finally, we present several case studies to motivate the use of the corpus for digital humanities in general and for dialectology in particular.Peer reviewe

Crossref

ZORA

Helsingin yliopiston digitaalinen arkisto

Computational Sociolinguistics: A Survey

Author: de Jong Franciska
Doğruöz A. Seza
Nguyen Dong
Rosé Carolyn P.
Publication venue
Publication date: 01/01/2016
Field of study

Language is a social phenomenon and variation is inherent to its social nature. Recently, there has been a surge of interest within the computational linguistics (CL) community in the social dimension of language. In this article we present a survey of the emerging field of "Computational Sociolinguistics" that reflects this increased interest. We aim to provide a comprehensive overview of CL research on sociolinguistic themes, featuring topics such as the relation between language and social identity, language use in social interaction and multilingual communication. Moreover, we demonstrate the potential for synergy between the research communities involved, by showing how the large-scale data-driven methods that are widely used in CL can complement existing sociolinguistic studies, and how sociolinguistics can inform and challenge the methods and assumptions employed in CL studies. We hope to convey the possible benefits of a closer collaboration between the two communities and conclude with a discussion of open challenges.Comment: To appear in Computational Linguistics. Accepted for publication: 18th February, 201

arXiv.org e-Print Archive

Crossref

Ghent University Academic Bibliography

EUR Research Repository

University of Twente Research Information

Regionalized models for Spanish language variations based on Twitter

Author: Graff Mario
Miranda Sabino
Moctezuma Daniela
Ruiz Guillermo
Tellez Eric S.
Publication venue
Publication date: 22/04/2022
Field of study

Spanish is one of the most spoken languages in the globe, but not necessarily Spanish is written and spoken in the same way in different countries. Understanding local language variations can help to improve model performances on regional tasks, both understanding local structures and also improving the message's content. For instance, think about a machine learning engineer who automatizes some language classification task on a particular region or a social scientist trying to understand a regional event with echoes on social media; both can take advantage of dialect-based language models to understand what is happening with more contextual information hence more precision. This manuscript presents and describes a set of regionalized resources for the Spanish language built on four-year Twitter public messages geotagged in 26 Spanish-speaking countries. We introduce word embeddings based on FastText, language models based on BERT, and per-region sample corpora. We also provide a broad comparison among regions covering lexical and semantical similarities; as well as examples of using regional resources on message classification tasks

arXiv.org e-Print Archive

Italian Language and Dialect Identification and Regional French Variety Detection using Adaptive Naive Bayes

Author: Jauhiainen Heidi
Jauhiainen Tommi
Lindén Krister
Publication venue: COLING
Publication date: 12/10/2022
Field of study

Peer reviewe

Helsingin yliopiston digitaalinen arkisto

Methods for large-scale data analyses of regional language variation based on speech acoustics

Author: Kisler Thomas
Publication venue: Ludwig-Maximilians-Universität München
Publication date: 08/02/2019
Field of study

Digitale Hochschulschriften der LMU

Dialect areas and contact dialectology

Author: Hasse Anja
Jeszenszky Péter
Stoeckle Philipp
Publication venue: Language Science Press
Publication date: 01/01/2023
Field of study

Spatial variation of language has been researched qualitatively and quantitatively for at least 150 years by different sub-disciplines of linguistics, each defining differently what dialects and dialect areas are. Linguists agree, however, that the concept of dialect is vague and the extent of a dialect is fuzzy. With contact being a crucial driver of linguistic change at sub-language levels, we attempt to sketch the perspective that contact dialectology and related sub-disciplines can offer on this fuzziness with regard to the spatial variation of dialects and dialect areas. Thus we address contact processes and patterns characterizing individuals, groups, communities, areas and beyond, at temporal scales spanning from mundane contact through generations to deeper time enough for dialects to diverge and disappear

ZORA

Dialect-robust Evaluation of Generated Text

Author: Clark Elizabeth
Dozat Timothy
Eisenstein Jacob
Garrette Dan
Gehrmann Sebastian
Sellam Thibault
Siddhant Aditya
Sun Jiao
Vu Tu
Publication venue
Publication date: 02/11/2022
Field of study

Evaluation metrics that are not robust to dialect variation make it impossible to tell how well systems perform for many groups of users, and can even penalize systems for producing text in lower-resource dialects. However, currently, there exists no way to quantify how metrics respond to change in the dialect of a generated utterance. We thus formalize dialect robustness and dialect awareness as goals for NLG evaluation metrics. We introduce a suite of methods and corresponding statistical tests one can use to assess metrics in light of the two goals. Applying the suite to current state-of-the-art metrics, we demonstrate that they are not dialect-robust and that semantic perturbations frequently lead to smaller decreases in a metric than the introduction of dialect features. As a first step to overcome this limitation, we propose a training schema, NANO, which introduces regional and language information to the pretraining process of a metric. We demonstrate that NANO provides a size-efficient way for models to improve the dialect robustness while simultaneously improving their performance on the standard metric benchmark

arXiv.org e-Print Archive

Mining for Parsing Failures

Author: de Kok Daniël
van Noord Gerardus
Publication venue: College Publications
Publication date: 01/01/2017
Field of study

Proceedings - University of Groningen

University of Groningen

ARTS repository - University of Groningen

Dissertations of the University of Groningen