Search CORE

15 research outputs found

Digitising Swiss German : how to process and study a polycentric spoken language

Author: Glaser Elvira
Samardžić Tanja
Scherrer Yves
Publication venue
Publication date: 29/11/2019
Field of study

Swiss dialects of German are, unlike many dialects of other standardised languages, widely used in everyday communication. Despite this fact, automatic processing of Swiss German is still a considerable challenge due to the fact that it is mostly a spoken variety and that it is subject to considerable regional variation. This paper presents the ArchiMob corpus, a freely available general-purpose corpus of spoken Swiss German based on oral history interviews. The corpus is a result of a long design process, intensive manual work and specially adapted computational processing. We first present the modalities of access of the corpus for linguistic, historic and computational research. We then describe how the documents were transcribed, segmented and aligned with the sound source. This work involved a series of experiments that have led to automatically annotated normalisation and part-of-speech tagging layers. Finally, we present several case studies to motivate the use of the corpus for digital humanities in general and for dialectology in particular.Peer reviewe

Crossref

ZORA

Helsingin yliopiston digitaalinen arkisto

ArchiMob : A multidialectal corpus of Swiss German spontaneous speech

Author: Glaser Elvira
Samardžić Tanja
Scherrer Yves
Publication venue
Publication date: 01/01/2019
Field of study

Alemannische Dialektologie – Forschungsstand und Perspektiven. SonderheftPeer reviewe

Directory of Open Access Journals

ZORA

Helsingin yliopiston digitaalinen arkisto

BOP Serials

SwissDial: Parallel Multidialectal Corpus of Spoken Swiss German

Author: Dogan-Schönberger Pelin
Hofmann Thomas
Mäder Julian
Publication venue
Publication date: 21/03/2021
Field of study

Swiss German is a dialect continuum whose natively acquired dialects significantly differ from the formal variety of the language. These dialects are mostly used for verbal communication and do not have standard orthography. This has led to a lack of annotated datasets, rendering the use of many NLP methods infeasible. In this paper, we introduce the first annotated parallel corpus of spoken Swiss German across 8 major dialects, plus a Standard German reference. Our goal has been to create and to make available a basic dataset for employing data-driven NLP applications in Swiss German. We present our data collection procedure in detail and validate the quality of our corpus by conducting experiments with the recent neural models for speech synthesis

arXiv.org e-Print Archive

Repository for Publications and Research Data

A Large-Scale Comparison of Historical Text Normalization Systems

Author: Bollmann Marcel
Publication venue: 'Association for Computational Linguistics (ACL)'
Publication date: 01/01/2019
Field of study

There is no consensus on the state-of-the-art approach to historical text normalization. Many techniques have been proposed, including rule-based methods, distance metrics, character-based statistical machine translation, and neural encoder--decoder models, but studies have used different datasets, different evaluation methods, and have come to different conclusions. This paper presents the largest study of historical text normalization done so far. We critically survey the existing literature and report experiments on eight languages, comparing systems spanning all categories of proposed normalization techniques, analysing the effect of training data quantity, and using different evaluation methods. The datasets and scripts are made publicly available.Comment: Accepted at NAACL 201

arXiv.org e-Print Archive

Crossref

Publikationer från Linköpings universitet

Copenhagen University Research Information System

Digitala Vetenskapliga Arkivet - Academic Archive On-line

Natural language processing for similar languages, varieties, and dialects: A survey

Author: Nakov Preslav
Scherrer Yves
Zampieri Marcos
Publication venue
Publication date: 20/11/2020
Field of study

There has been a lot of recent interest in the natural language processing (NLP) community in the computational processing of language varieties and dialects, with the aim to improve the performance of applications such as machine translation, speech recognition, and dialogue systems. Here, we attempt to survey this growing field of research, with focus on computational methods for processing similar languages, varieties, and dialects. In particular, we discuss the most important challenges when dealing with diatopic language variation, and we present some of the available datasets, the process of data collection, and the most common data collection strategies used to compile datasets for similar languages, varieties, and dialects. We further present a number of studies on computational methods developed and/or adapted for preprocessing, normalization, part-of-speech tagging, and parsing similar languages, language varieties, and dialects. Finally, we discuss relevant applications such as language and dialect identification and machine translation for closely related languages, language varieties, and dialects.Non peer reviewe

Helsingin yliopiston digitaalinen arkisto

Corpus linguistics for low-density varieties. Minority languages and corpus-based morphological investigations

Author: Bellante Marco
Cioffi Raffaele
Gaeta Livio
Publication venue
Publication date: 01/01/2022
Field of study

Institutional Research Information System University of Turin

A Report on the Third VarDial Evaluation Campaign

Author: Butnaru Andrei
Huang Chu-Ren
Ionescu Radu Tudor
Jauhiainen Tommi Sakari
Klyueva Natalia
Malmasi Shervin
Pan Tung-Le
Samardžic Tanja
Scherrer Yves
Silfverberg Miikka Pietari
Tyers Francis
Zampieri Marcos
Publication venue: The Association for Computational Linguistics
Publication date: 01/01/2019
Field of study

Non peer reviewe

Helsingin yliopiston digitaalinen arkisto

Wolverhampton Intellectual Repository and E-theses

New Developments in Tagging Pre-modern Orthodox Slavic Texts

Author: Mocken Susanne
Rabus Achim
Scherrer Yves
Publication venue
Publication date: 01/01/2018
Field of study

Pre-modern Orthodox Slavic texts pose certain difficulties when it comes to part-of-speech and full morphological tagging. Orthographic and morphological heterogeneity makes it hard to apply resources that rely on normalized data, which is why previous attempts to train part-of-speech (POS) taggers for pre-modern Slavic often apply normalization routines. In the current paper, we further explore the normalization path; at the same time, we use the statistical CRF-tagger MarMoT and a newly developed neural network tagger that cope better with variation than previously applied rule-based or statistical taggers. Furthermore, we conduct transfer experiments to apply Modern Russian resources to pre-modern data. Our experiments show that while transfer experiments could not improve tagging performance significantly, state-of-the-art taggers reach between 90% and more than 95% tagging accuracy and thus approach the tagging accuracy of modern standard languages with rich morphology. Remarkably, these results are achieved without the need for normalization, which makes our research of practical relevance to the Paleoslavistic community.Peer reviewe

Helsingin yliopiston digitaalinen arkisto