366 research outputs found
On the Effectiveness of Neural Text Generation based Data Augmentation for Recognition of Morphologically Rich Speech
Advanced neural network models have penetrated Automatic Speech Recognition
(ASR) in recent years, however, in language modeling many systems still rely on
traditional Back-off N-gram Language Models (BNLM) partly or entirely. The
reason for this are the high cost and complexity of training and using neural
language models, mostly possible by adding a second decoding pass (rescoring).
In our recent work we have significantly improved the online performance of a
conversational speech transcription system by transferring knowledge from a
Recurrent Neural Network Language Model (RNNLM) to the single pass BNLM with
text generation based data augmentation. In the present paper we analyze the
amount of transferable knowledge and demonstrate that the neural augmented LM
(RNN-BNLM) can help to capture almost 50% of the knowledge of the RNNLM yet by
dropping the second decoding pass and making the system real-time capable. We
also systematically compare word and subword LMs and show that subword-based
neural text augmentation can be especially beneficial in under-resourced
conditions. In addition, we show that using the RNN-BNLM in the first pass
followed by a neural second pass, offline ASR results can be even significantly
improved.Comment: 8 pages, 2 figures, accepted for publication at TSD 202
Data Augmentation using Transformers and Similarity Measures for Improving Arabic Text Classification
The performance of learning models heavily relies on the availability and
adequacy of training data. To address the dataset adequacy issue, researchers
have extensively explored data augmentation (DA) as a promising approach. DA
generates new data instances through transformations applied to the available
data, thereby increasing dataset size and variability. This approach has
enhanced model performance and accuracy, particularly in addressing class
imbalance problems in classification tasks. However, few studies have explored
DA for the Arabic language, relying on traditional approaches such as
paraphrasing or noising-based techniques. In this paper, we propose a new
Arabic DA method that employs the recent powerful modeling technique, namely
the AraGPT-2, for the augmentation process. The generated sentences are
evaluated in terms of context, semantics, diversity, and novelty using the
Euclidean, cosine, Jaccard, and BLEU distances. Finally, the AraBERT
transformer is used on sentiment classification tasks to evaluate the
classification performance of the augmented Arabic dataset. The experiments
were conducted on four sentiment Arabic datasets: AraSarcasm, ASTD, ATT, and
MOVIE. The selected datasets vary in size, label number, and unbalanced
classes. The results show that the proposed methodology enhanced the Arabic
sentiment text classification on all datasets with an increase in F1 score by
4% in AraSarcasm, 6% in ASTD, 9% in ATT, and 13% in MOVIE.Comment: 15 pages, 16 Figures, this work has been submitted to the IEEE Access
Journal for possible publicatio
Data Augmentation Techniques for Machine Translation of Code-Switched Texts: A Comparative Study
Code-switching (CSW) text generation has been receiving increasing attention
as a solution to address data scarcity. In light of this growing interest, we
need more comprehensive studies comparing different augmentation approaches. In
this work, we compare three popular approaches: lexical replacements,
linguistic theories, and back-translation (BT), in the context of Egyptian
Arabic-English CSW. We assess the effectiveness of the approaches on machine
translation and the quality of augmentations through human evaluation. We show
that BT and CSW predictive-based lexical replacement, being trained on CSW
parallel data, perform best on both tasks. Linguistic theories and random
lexical replacement prove to be effective in the lack of CSW parallel data,
where both approaches achieve similar results.Comment: Findings of EMNLP 202
Deepfake detection and low-resource language speech recognition using deep learning
While deep learning algorithms have made significant progress in automatic speech recognition and natural language processing, they require a significant amount of labelled training data to perform effectively. As such, these applications have not been extended to languages that have only limited amount of data available, such as extinct or endangered languages. Another problem caused by the rise of deep learning is that individuals with malicious intents have been able to leverage these algorithms to create fake contents that can pose serious harm to security and public safety. In this work, we explore the solutions to both of these problems. First, we investigate different data augmentation methods and acoustic architecture designs to improve automatic speech recognition performance on low-resource languages. Data augmentation for audio often involves changing the characteristic of the audio without modifying the ground truth. For example, different background noise can be added to an utterance while maintaining the content of the speech. We also explored how different acoustic model paradigms and complexity affect performance on low-resource languages. These methods are evaluated on Seneca, an endangered language spoken by a Native American tribe, and Iban, a low-resource language spoken in Malaysia and Brunei. Secondly, we explore methods to determine speaker identification and audio spoofing detection. A spoofing attack involves using either a text-to-speech voice conversion application to generate audio that mimic the identity of a target speaker. These methods are evaluated on the ASVSpoof 2019 Logical Access dataset containing audio generated using various methods of voice conversion and text-to-speech synthesis
Linguistically-Informed Neural Architectures for Lexical, Syntactic and Semantic Tasks in Sanskrit
The primary focus of this thesis is to make Sanskrit manuscripts more
accessible to the end-users through natural language technologies. The
morphological richness, compounding, free word orderliness, and low-resource
nature of Sanskrit pose significant challenges for developing deep learning
solutions. We identify four fundamental tasks, which are crucial for developing
a robust NLP technology for Sanskrit: word segmentation, dependency parsing,
compound type identification, and poetry analysis. The first task, Sanskrit
Word Segmentation (SWS), is a fundamental text processing task for any other
downstream applications. However, it is challenging due to the sandhi
phenomenon that modifies characters at word boundaries. Similarly, the existing
dependency parsing approaches struggle with morphologically rich and
low-resource languages like Sanskrit. Compound type identification is also
challenging for Sanskrit due to the context-sensitive semantic relation between
components. All these challenges result in sub-optimal performance in NLP
applications like question answering and machine translation. Finally, Sanskrit
poetry has not been extensively studied in computational linguistics.
While addressing these challenges, this thesis makes various contributions:
(1) The thesis proposes linguistically-informed neural architectures for these
tasks. (2) We showcase the interpretability and multilingual extension of the
proposed systems. (3) Our proposed systems report state-of-the-art performance.
(4) Finally, we present a neural toolkit named SanskritShala, a web-based
application that provides real-time analysis of input for various NLP tasks.
Overall, this thesis contributes to making Sanskrit manuscripts more accessible
by developing robust NLP technology and releasing various resources, datasets,
and web-based toolkit.Comment: Ph.D. dissertatio
Transfer learning and subword sampling for asymmetric-resource one-to-many neural translation
There are several approaches for improving neural machine translation for low-resource languages: monolingual data can be exploited via pretraining or data augmentation; parallel corpora on related language pairs can be used via parameter sharing or transfer learning in multilingual models; subword segmentation and regularization techniques can be applied to ensure high coverage of the vocabulary. We review these approaches in the context of an asymmetric-resource one-to-many translation task, in which the pair of target languages are related, with one being a very low-resource and the other a higher-resource language. We test various methods on three artificially restricted translation tasks—English to Estonian (low-resource) and Finnish (high-resource), English to Slovak and Czech, English to Danish and Swedish—and one real-world task, Norwegian to North Sámi and Finnish. The experiments show positive effects especially for scheduled multi-task learning, denoising autoencoder, and subword sampling.There are several approaches for improving neural machine translation for low-resource languages: monolingual data can be exploited via pretraining or data augmentation; parallel corpora on related language pairs can be used via parameter sharing or transfer learning in multilingual models; subword segmentation and regularization techniques can be applied to ensure high coverage of the vocabulary. We review these approaches in the context of an asymmetric-resource one-to-many translation task, in which the pair of target languages are related, with one being a very low-resource and the other a higher-resource language. We test various methods on three artificially restricted translation tasks-English to Estonian (low-resource) and Finnish (high-resource), English to Slovak and Czech, English to Danish and Swedish-and one real-world task, Norwegian to North Sami and Finnish. The experiments show positive effects especially for scheduled multi-task learning, denoising autoencoder, and subword sampling.Peer reviewe
Crosslingual Retrieval Augmented In-context Learning for Bangla
The promise of Large Language Models (LLMs) in Natural Language Processing
has often been overshadowed by their limited performance in low-resource
languages such as Bangla. To address this, our paper presents a pioneering
approach that utilizes cross-lingual retrieval augmented in-context learning.
By strategically sourcing semantically similar prompts from high-resource
language, we enable multilingual pretrained language models (MPLMs), especially
the generative model BLOOMZ, to successfully boost performance on Bangla tasks.
Our extensive evaluation highlights that the cross-lingual retrieval augmented
prompts bring steady improvements to MPLMs over the zero-shot performance.Comment: In The 1st Bangla Language Processing (BLP) Workshop, held in
conjunction with The Conference on Empirical Methods in Natural Language
Processing (EMNLP), December 202
- …