695 research outputs found
Automatic Transcription of Northern Prinmi Oral Art: Approaches and Challenges to Automatic Speech Recognition for Language Documentation
One significant issue facing language documentation efforts is the transcription bottleneck: each documented recording must be transcribed and annotated, and these tasks are extremely labor intensive (Äavar et al., 2016). Researchers have sought to accelerate these tasks with partial automation via forced alignment, natural language processing, and automatic speech recognition (ASR) (Neubig et al., 2020). Neural networkâespecially transformer-basedâapproaches have enabled large advances in ASR over the last decade. Models like XLSR-53 promise improved performance on under-resourced languages by leveraging massive data sets from many different languages (Conneau et al., 2020). This project extends these efforts to a novel context, applying XLSR-53 to Northern Prinmi, a Tibeto-Burman Qiangic language spoken in Southwest China (Daudey & Pincuo, 2020).
Specifically, this thesis aims to answer two questions. First, is the XLSR-53 ASR model useful for first-pass transcription of oral art recordings from Northern Prinmi, an under-resourced tonal language? Second, does preprocessing target transcripts to combine grapheme clustersâmulti-character representations of lexical tones and characters with modifying diacriticsâinto more phonologically salient units improve the model\u27s predictions? Results indicate thatâwith substantial adaptationsâXLSR-53 will be useful for this task, and that preprocessing to combine grapheme clusters does improve model performance
A Very Low Resource Language Speech Corpus for Computational Language Documentation Experiments
Most speech and language technologies are trained with massive amounts of
speech and text information. However, most of the world languages do not have
such resources or stable orthography. Systems constructed under these almost
zero resource conditions are not only promising for speech technology but also
for computational language documentation. The goal of computational language
documentation is to help field linguists to (semi-)automatically analyze and
annotate audio recordings of endangered and unwritten languages. Example tasks
are automatic phoneme discovery or lexicon discovery from the speech signal.
This paper presents a speech corpus collected during a realistic language
documentation process. It is made up of 5k speech utterances in Mboshi (Bantu
C25) aligned to French text translations. Speech transcriptions are also made
available: they correspond to a non-standard graphemic form close to the
language phonology. We present how the data was collected, cleaned and
processed and we illustrate its use through a zero-resource task: spoken term
discovery. The dataset is made available to the community for reproducible
computational language documentation experiments and their evaluation.Comment: accepted to LREC 201
Universal Automatic Phonetic Transcription into the International Phonetic Alphabet
This paper presents a state-of-the-art model for transcribing speech in any
language into the International Phonetic Alphabet (IPA). Transcription of
spoken languages into IPA is an essential yet time-consuming process in
language documentation, and even partially automating this process has the
potential to drastically speed up the documentation of endangered languages.
Like the previous best speech-to-IPA model (Wav2Vec2Phoneme), our model is
based on wav2vec 2.0 and is fine-tuned to predict IPA from audio input. We use
training data from seven languages from CommonVoice 11.0, transcribed into IPA
semi-automatically. Although this training dataset is much smaller than
Wav2Vec2Phoneme's, its higher quality lets our model achieve comparable or
better results. Furthermore, we show that the quality of our universal
speech-to-IPA models is close to that of human annotators.Comment: 5 pages, 7 table
Phonetic lessons from automatic phonemic transcription: preliminary reflections on Na (Sino-Tibetan) and Tsuutâina (Dene) data
International audienceAutomatic phonemic transcription tools now reach high levels of accuracy on a single speaker with relatively small amounts of training data: on the order of 100 to 250 minutes of transcribed speech. Beyond its practical usefulness for language documentation, use of automatic transcription also yields some insights for phoneticians. The present report illustrates this by going into qualitative error analysis on two test cases, Yongning Na (Sino-Tibetan) and Tsuutâina (Dene). Among other benefits, error analysis allows for a renewed exploration of phonetic detail: examining the output of phonemic transcription software compared with spectrographic and aural evidence. From a methodological point of view, the present report is intended as a case study in Computational Language Documentation: an interdisciplinary approach that associates fieldworkers (âdiversity linguistsâ) and computer scientists with phoneticians/phonologists
Integrating Automatic Transcription into the Language Documentation Workflow: Experiments with Na Data and the Persephone Toolkit
Automatic speech recognition tools have potential for facilitating language documentation, but in practice these tools remain little-used by linguists for a variety of reasons, such as that the technology is still new (and evolving rapidly), user-friendly interfaces are still under development, and case studies demonstrating the practical usefulness of automatic recognition in a low-resource setting remain few. This article reports on a success story in integrating automatic transcription into the language documentation workflow, specifically for Yongning Na, a language of Southwest China. Using Persephone, an open-source toolkit, a single-speaker speech transcription tool was trained over five hours of manually transcribed speech. The experiments found that this method can achieve a remarkably low error rate (on the order of 17%), and that automatic transcriptions were useful as a canvas for the linguist. The present report is intended for linguists with little or no knowledge of speech processing. It aims to provide insights into (i) the way the tool operates and (ii) the process of collaborating with natural language processing specialists. Practical recommendations are offered on how to anticipate the requirements of this type of technology from the early stages of data collection in the field.National Foreign Language Resource Cente
Open-vocabulary keyword spotting in any language through multilingual contrastive speech-phoneme pretraining
In this paper, we introduce a massively multilingual speech corpora with
fine-grained phonemic transcriptions, encompassing more than 115 languages from
diverse language families. Based on this multilingual dataset, we propose
CLAP-IPA, a multilingual phoneme-speech contrastive embedding model capable of
open-vocabulary matching between speech signals and phonemically transcribed
keywords or arbitrary phrases. The proposed model has been tested on two
fieldwork speech corpora in 97 unseen languages, exhibiting strong
generalizability across languages. Comparison with a text-based model shows
that using phonemes as modeling units enables much better crosslinguistic
generalization than orthographic texts.Comment: Preprint; Work in Progres
Towards Zero-shot Learning for Automatic Phonemic Transcription
Automatic phonemic transcription tools are useful for low-resource language
documentation. However, due to the lack of training sets, only a tiny fraction
of languages have phonemic transcription tools. Fortunately, multilingual
acoustic modeling provides a solution given limited audio training data. A more
challenging problem is to build phonemic transcribers for languages with zero
training data. The difficulty of this task is that phoneme inventories often
differ between the training languages and the target language, making it
infeasible to recognize unseen phonemes. In this work, we address this problem
by adopting the idea of zero-shot learning. Our model is able to recognize
unseen phonemes in the target language without any training data. In our model,
we decompose phonemes into corresponding articulatory attributes such as vowel
and consonant. Instead of predicting phonemes directly, we first predict
distributions over articulatory attributes, and then compute phoneme
distributions with a customized acoustic model. We evaluate our model by
training it using 13 languages and testing it using 7 unseen languages. We find
that it achieves 7.7% better phoneme error rate on average over a standard
multilingual model.Comment: AAAI 202
- âŠ