45 research outputs found
A small Griko-Italian speech translation corpus
This paper presents an extension to a very low-resource parallel corpus collected in an endangered language, Griko, making it useful for computational research. The corpus consists of 330 utterances (about 2 hours of speech) which have been transcribed and translated in Italian, with annotations for word-level speech-to-transcription and speech-to-translation alignments. The corpus also includes morpho syntactic tags and word-level glosses. Applying an automatic unit discovery method, pseudo-phones were also generated. We detail how the corpus was collected, cleaned and processed, and we illustrate its use on zero-resource tasks by presenting some baseline results for the task of speech-to-translation alignment and unsupervised word discovery. The dataset will be available online, aiming to encourage replicability and diversity in computational language documentation experiments
NLP for Language Varieties of Italy: Challenges and the Path Forward
Italy is characterized by a one-of-a-kind linguistic diversity landscape in
Europe, which implicitly encodes local knowledge, cultural traditions, artistic
expression, and history of its speakers. However, over 30 language varieties in
Italy are at risk of disappearing within few generations. Language technology
has a main role in preserving endangered languages, but it currently struggles
with such varieties as they are under-resourced and mostly lack standardized
orthography, being mainly used in spoken settings. In this paper, we introduce
the linguistic context of Italy and discuss challenges facing the development
of NLP technologies for Italy's language varieties. We provide potential
directions and advocate for a shift in the paradigm from machine-centric to
speaker-centric NLP. Finally, we propose building a local community towards
responsible, participatory development of speech and language technologies for
languages and dialects of Italy.Comment: 16 pages, 3 figures, 4 table
OCR Post Correction for Endangered Language Texts
There is little to no data available to build natural language processing
models for most endangered languages. However, textual data in these languages
often exists in formats that are not machine-readable, such as paper books and
scanned images. In this work, we address the task of extracting text from these
resources. We create a benchmark dataset of transcriptions for scanned books in
three critically endangered languages and present a systematic analysis of how
general-purpose OCR tools are not robust to the data-scarce setting of
endangered languages. We develop an OCR post-correction method tailored to ease
training in this data-scarce setting, reducing the recognition error rate by
34% on average across the three languages.Comment: Accepted to EMNLP 202
Contemporary research in minoritized and diaspora languages of Europe
Synopsis:
This volume provides a collection of research reports on multilingualism and language contact ranging from Romance, to Germanic, Greco and Slavic languages in situations of contact and diaspora. Most of the contributions are empirically-oriented studies presenting first-hand data based on original fieldwork, and a few focus directly on the methodological issues in such research. Owing to the multifaceted nature of contact and diaspora phenomena (e.g. the intrinsic transnational essence of contact and diaspora, and the associated interplay between majority and minoritized languages and multilingual practices in different contact settings, contact-induced language change, and issues relating to convergence) the disciplinary scope is broad, and includes ethnography, qualitative and quantitative sociolinguistics, formal linguistics, descriptive linguistics, contact linguistics, historical linguistics, and language acquisition. Case studies are drawn from Italo-Romance varieties in the Americas, Spanish-Nahuatl contact, Castellano Andino, Greko/Griko in Southern Italy, Yiddish in Anglophone communities, Frisian in the Netherlands, Wymysiöryś in Poland, Sorbian in Germany, and Pomeranian and Zeelandic Flemish in Brazil