17,888 research outputs found
Data augmentation for low resource languages
Recently there has been interest in the approaches for train-ing speech recognition systems for languages with limited re-sources. Under the IARPA Babel program such resources have been provided for a range of languages to support this research area. This paper examines a particular form of approach, data augmentation, that can be applied to these situations. Data aug-mentation schemes aim to increase the quantity of data available to train the system, for example semi-supervised training, multi-lingual processing, acoustic data perturbation and speech syn-thesis. To date the majority of work has considered individual data augmentation schemes, with few consistent performance contrasts or examination of whether the schemes are comple-mentary. In this work two data augmentation schemes, semi-supervised training and vocal tract length perturbation, are ex-amined and combined on the Babel limited language pack con-figuration. Here only about 10 hours of transcribed acoustic data are available. Two languages are examined, Assamese and Zulu, which were found to be the most challenging of the Ba-bel languages released for the 2014 Evaluation. For both lan-guages consistent speech recognition performance gains can be obtained using these augmentation schemes. Furthermore the impact of these performance gains on a down-stream keyword spotting task are also described. Index Terms: data augmentation, speech recognition, babel 1
Data augmentation for automatic speech recognition for low resource languages
In this thesis, we explore several novel data augmentation methods for improving the performance of automatic speech recognition (ASR) on low-resource languages. Using a 100-hour subset of English LibriSpeech to simulate a low-resource setting, we compare the well-known SpecAugment augmentation approach to these new methods, along with several other competitive baselines. We then apply the most promising combinations of models and augmentation methods to three genuinely under-resourced languages using the 40-hour Gujarati, Tamil, Telugu datasets from the 2021 Interspeech Low Resource Automatic Speech Recognition Challenge for Indian Languages. Our data augmentation approaches, coupled with state-of-the-art acoustic model architectures and language models, yield reductions in word error rate over SpecAugment and other competitive baselines for the LibriSpeech-100 dataset, showing a particular advantage over prior models for the ``other\u27\u27, more challenging, dev and test sets. Extending this work to the low-resource Indian languages, we see large improvements over the baseline models and results comparable to large multilingual models
Data Augmentation via Dependency Tree Morphing for Low Resource Languages
Neural NLP systems achieve high scores in the presence of sizable training
dataset. Lack of such datasets leads to poor system performances in the case
low-resource languages. We present two simple text augmentation techniques
using dependency trees, inspired from image processing. We crop sentences by
removing dependency links, and we rotate sentences by moving the tree fragments
around the root. We apply these techniques to augment the training sets of
low-resource languages in Universal Dependencies project. We implement a
character-level sequence tagging model and evaluate the augmented datasets on
part-of-speech tagging task. We show that crop and rotate provides improvements
over the models trained with non-augmented data for majority of the languages,
especially for languages with rich case marking systems
Reduce, Reuse, Recycle: Is Perturbed Data better than Other Language augmentation for Low Resource Self-Supervised Speech Models
Self-supervised representation learning (SSRL) has improved the performance
on downstream phoneme recognition versus supervised models. Training SSRL
models requires a large amount of pre-training data and this poses a challenge
for low resource languages. A common approach is transferring knowledge from
other languages. Instead, we propose to use audio augmentation to pre-train
SSRL models in a low resource condition and evaluate phoneme recognition as
downstream task. We performed a systematic comparison of augmentation
techniques, namely: pitch variation, noise addition, accented target-language
speech and other language speech. We found combined augmentations (noise/pitch)
was the best augmentation strategy outperforming accent and language knowledge
transfer. We compared the performance with various quantities and types of
pre-training data. We examined the scaling factor of augmented data to achieve
equivalent performance to models pre-trained with target domain speech. Our
findings suggest that for resource constrained languages, in-domain synthetic
augmentation can outperform knowledge transfer from accented or other language
speech.Comment: 5 pages, 4 figures, ICASSP2
IndiText Boost: Text Augmentation for Low Resource India Languages
Text Augmentation is an important task for low-resource languages. It helps
deal with the problem of data scarcity. A data augmentation strategy is used to
deal with the problem of data scarcity. Through the years, much work has been
done on data augmentation for the English language. In contrast, very less work
has been done on Indian languages. This is contrary to the fact that data
augmentation is used to deal with data scarcity. In this work, we focus on
implementing techniques like Easy Data Augmentation, Back Translation,
Paraphrasing, Text Generation using LLMs, and Text Expansion using LLMs for
text classification on different languages. We focus on 6 Indian languages
namely: Sindhi, Marathi, Hindi, Gujarati, Telugu, and Sanskrit. According to
our knowledge, no such work exists for text augmentation on Indian languages.
We carry out binary as well as multi-class text classification to make our
results more comparable. We get surprising results as basic data augmentation
techniques surpass LLMs
Transfer learning and subword sampling for asymmetric-resource one-to-many neural translation
There are several approaches for improving neural machine translation for low-resource languages: monolingual data can be exploited via pretraining or data augmentation; parallel corpora on related language pairs can be used via parameter sharing or transfer learning in multilingual models; subword segmentation and regularization techniques can be applied to ensure high coverage of the vocabulary. We review these approaches in the context of an asymmetric-resource one-to-many translation task, in which the pair of target languages are related, with one being a very low-resource and the other a higher-resource language. We test various methods on three artificially restricted translation tasks—English to Estonian (low-resource) and Finnish (high-resource), English to Slovak and Czech, English to Danish and Swedish—and one real-world task, Norwegian to North Sámi and Finnish. The experiments show positive effects especially for scheduled multi-task learning, denoising autoencoder, and subword sampling.There are several approaches for improving neural machine translation for low-resource languages: monolingual data can be exploited via pretraining or data augmentation; parallel corpora on related language pairs can be used via parameter sharing or transfer learning in multilingual models; subword segmentation and regularization techniques can be applied to ensure high coverage of the vocabulary. We review these approaches in the context of an asymmetric-resource one-to-many translation task, in which the pair of target languages are related, with one being a very low-resource and the other a higher-resource language. We test various methods on three artificially restricted translation tasks-English to Estonian (low-resource) and Finnish (high-resource), English to Slovak and Czech, English to Danish and Swedish-and one real-world task, Norwegian to North Sami and Finnish. The experiments show positive effects especially for scheduled multi-task learning, denoising autoencoder, and subword sampling.Peer reviewe
Distributional Data Augmentation Methods for Low Resource Language
Text augmentation is a technique for constructing synthetic data from an
under-resourced corpus to improve predictive performance. Synthetic data
generation is common in numerous domains. However, recently text augmentation
has emerged in natural language processing (NLP) to improve downstream tasks.
One of the current state-of-the-art text augmentation techniques is easy data
augmentation (EDA), which augments the training data by injecting and replacing
synonyms and randomly permuting sentences. One major obstacle with EDA is the
need for versatile and complete synonym dictionaries, which cannot be easily
found in low-resource languages. To improve the utility of EDA, we propose two
extensions, easy distributional data augmentation (EDDA) and type specific
similar word replacement (TSSR), which uses semantic word context information
and part-of-speech tags for word replacement and augmentation. In an extensive
empirical evaluation, we show the utility of the proposed methods, measured by
F1 score, on two representative datasets in Swedish as an example of a
low-resource language. With the proposed methods, we show that augmented data
improve classification performances in low-resource settings.Comment: AAAI 2023 Workshop on Knowledge Augmented Methods for NL
Cross-lingual Data Augmentation for Document-grounded Dialog Systems in Low Resource Languages
This paper proposes a framework to address the issue of data scarcity in
Document-Grounded Dialogue Systems(DGDS). Our model leverages high-resource
languages to enhance the capability of dialogue generation in low-resource
languages. Specifically, We present a novel pipeline CLEM (Cross-Lingual
Enhanced Model) including adversarial training retrieval (Retriever and
Re-ranker), and Fid (fusion-in-decoder) generator. To further leverage
high-resource language, we also propose an innovative architecture to conduct
alignment across different languages with translated training. Extensive
experiment results demonstrate the effectiveness of our model and we achieved
4th place in the DialDoc 2023 Competition. Therefore, CLEM can serve as a
solution to resource scarcity in DGDS and provide useful guidance for
multi-lingual alignment tasks
Making more of little data:Improving low-resource automatic speech recognition using data augmentation
The performance of automatic speech recognition (ASR) systems has advanced substantially in recent years, particularly for languages for which a large amount of transcribed speech is available. Unfortunately, for low-resource languages, such as minority languages, regional languages or dialects, ASR performance generally remains much lower. In this study, we investigate whether data augmentation techniques could help improve low-resource ASR performance, focusing on four typologically diverse minority languages or language variants (West Germanic: Gronings, West-Frisian; Malayo-Polynesian: Besemah, Nasal). For all four languages, we examine the use of self-training, where an ASR system trained with the available human-transcribed data is used to generate transcriptions, which are then combined with the original data to train a new ASR system. For Gronings, for which there was a pre-existing text-to-speech (TTS) system available, we also examined the use of TTS to generate ASR training data from text-only sources. We find that using a self-training approach consistently yields improved performance (a relative WER reduction up to 20.5% compared to using an ASR system trained on 24 minutes of manually transcribed speech). The performance gain from TTS augmentation for Gronings was even stronger (up to 25.5% relative reduction in WER compared to a system based on 24 minutes of manually transcribed speech). In sum, our results show the benefit of using self-training or (if possible) TTS-generated data as an efficient solution to overcome the limitations of data availability for resource-scarce languages in order to improve ASR performance
Analysis of Data Augmentation Methods for Low-Resource Maltese ASR
Recent years have seen an increased interest in the computational speech
processing of Maltese, but resources remain sparse. In this paper, we consider
data augmentation techniques for improving speech recognition for low-resource
languages, focusing on Maltese as a test case. We consider three different
types of data augmentation: unsupervised training, multilingual training and
the use of synthesized speech as training data. The goal is to determine which
of these techniques, or combination of them, is the most effective to improve
speech recognition for languages where the starting point is a small corpus of
approximately 7 hours of transcribed speech. Our results show that combining
the data augmentation techniques studied here lead us to an absolute WER
improvement of 15% without the use of a language model.Comment: 12 page
- …