Search CORE

45 research outputs found

A small Griko-Italian speech translation corpus

Author: Anastasopoulos A.
Besacier L.
Lekakou M.
Villavicencio A.
Zanon Boito M.
Publication venue: 'International Speech Communication Association'
Publication date: 27/07/2018
Field of study

This paper presents an extension to a very low-resource parallel corpus collected in an endangered language, Griko, making it useful for computational research. The corpus consists of 330 utterances (about 2 hours of speech) which have been transcribed and translated in Italian, with annotations for word-level speech-to-transcription and speech-to-translation alignments. The corpus also includes morpho syntactic tags and word-level glosses. Applying an automatic unit discovery method, pseudo-phones were also generated. We detail how the corpus was collected, cleaned and processed, and we illustrate its use on zero-resource tasks by presenting some baseline results for the task of speech-to-translation alignment and unsupervised word discovery. The dataset will be available online, aiming to encourage replicability and diversity in computational language documentation experiments

arXiv.org e-Print Archive

Crossref

Hal - Université Grenoble Alpes

White Rose Research Online

NLP for Language Varieties of Italy: Challenges and the Path Forward

Author: Ramponi Alan
Publication venue
Publication date: 20/09/2022
Field of study

Italy is characterized by a one-of-a-kind linguistic diversity landscape in Europe, which implicitly encodes local knowledge, cultural traditions, artistic expression, and history of its speakers. However, over 30 language varieties in Italy are at risk of disappearing within few generations. Language technology has a main role in preserving endangered languages, but it currently struggles with such varieties as they are under-resourced and mostly lack standardized orthography, being mainly used in spoken settings. In this paper, we introduce the linguistic context of Italy and discuss challenges facing the development of NLP technologies for Italy's language varieties. We provide potential directions and advocate for a shift in the paradigm from machine-centric to speaker-centric NLP. Finally, we propose building a local community towards responsible, participatory development of speech and language technologies for languages and dialects of Italy.Comment: 16 pages, 3 figures, 4 table

arXiv.org e-Print Archive

OCR Post Correction for Endangered Language Texts

Author: Anastasopoulos Antonios
Neubig Graham
Rijhwani Shruti
Publication venue
Publication date: 01/01/2020
Field of study

There is little to no data available to build natural language processing models for most endangered languages. However, textual data in these languages often exists in formats that are not machine-readable, such as paper books and scanned images. In this work, we address the task of extracting text from these resources. We create a benchmark dataset of transcriptions for scanned books in three critically endangered languages and present a systematic analysis of how general-purpose OCR tools are not robust to the data-scarce setting of endangered languages. We develop an OCR post-correction method tailored to ease training in this data-scarce setting, reducing the recognition error rate by 34% on average across the three languages.Comment: Accepted to EMNLP 202

arXiv.org e-Print Archive

Crossref

Proceedings of the 24th Scandinavian Conference of Linguistics

Author: Anttikoski Esa
Tirkkonen Jani-Matti
Publication venue: University of Eastern Finland
Publication date
Field of study

UEF Electronic Publications

Contemporary research in minoritized and diaspora languages of Europe

Author
Publication venue
Publication date: 01/01/2023
Field of study

Synopsis: This volume provides a collection of research reports on multilingualism and language contact ranging from Romance, to Germanic, Greco and Slavic languages in situations of contact and diaspora. Most of the contributions are empirically-oriented studies presenting first-hand data based on original fieldwork, and a few focus directly on the methodological issues in such research. Owing to the multifaceted nature of contact and diaspora phenomena (e.g. the intrinsic transnational essence of contact and diaspora, and the associated interplay between majority and minoritized languages and multilingual practices in different contact settings, contact-induced language change, and issues relating to convergence) the disciplinary scope is broad, and includes ethnography, qualitative and quantitative sociolinguistics, formal linguistics, descriptive linguistics, contact linguistics, historical linguistics, and language acquisition. Case studies are drawn from Italo-Romance varieties in the Americas, Spanish-Nahuatl contact, Castellano Andino, Greko/Griko in Southern Italy, Yiddish in Anglophone communities, Frisian in the Netherlands, Wymysiöryś in Poland, Sorbian in Germany, and Pomeranian and Zeelandic Flemish in Brazil

Institutional Repository of the Freie Universität Berlin

Vowel variation in the Mišótika Cappadocian of Mandra (Larisa)

Author: Janse Mark
Papazachariou Dimitris
Vassalou Nikoleta
Publication venue: University of Patras
Publication date: 01/01/2021
Field of study

Ghent University Academic Bibliography

Subject contact relatives in Asia Minor Greek

Author: Bagriacik Metin
Publication venue: University of Patras
Publication date: 01/01/2021
Field of study

Ghent University Academic Bibliography