11 research outputs found
XLS-R Deep Learning Model for Multilingual ASR on Low- Resource Languages: Indonesian, Javanese, and Sundanese
This research paper focuses on the development and evaluation of Automatic
Speech Recognition (ASR) technology using the XLS-R 300m model. The study aims
to improve ASR performance in converting spoken language into written text,
specifically for Indonesian, Javanese, and Sundanese languages. The paper
discusses the testing procedures, datasets used, and methodology employed in
training and evaluating the ASR systems. The results show that the XLS-R 300m
model achieves competitive Word Error Rate (WER) measurements, with a slight
compromise in performance for Javanese and Sundanese languages. The integration
of a 5-gram KenLM language model significantly reduces WER and enhances ASR
accuracy. The research contributes to the advancement of ASR technology by
addressing linguistic diversity and improving performance across various
languages. The findings provide insights into optimizing ASR accuracy and
applicability for diverse linguistic contexts
Survey on Publicly Available Sinhala Natural Language Processing Tools and Research
Sinhala is the native language of the Sinhalese people who make up the
largest ethnic group of Sri Lanka. The language belongs to the globe-spanning
language tree, Indo-European. However, due to poverty in both linguistic and
economic capital, Sinhala, in the perspective of Natural Language Processing
tools and research, remains a resource-poor language which has neither the
economic drive its cousin English has nor the sheer push of the law of numbers
a language such as Chinese has. A number of research groups from Sri Lanka have
noticed this dearth and the resultant dire need for proper tools and research
for Sinhala natural language processing. However, due to various reasons, these
attempts seem to lack coordination and awareness of each other. The objective
of this paper is to fill that gap of a comprehensive literature survey of the
publicly available Sinhala natural language tools and research so that the
researchers working in this field can better utilize contributions of their
peers. As such, we shall be uploading this paper to arXiv and perpetually
update it periodically to reflect the advances made in the field
Framewise WaveGAN: High Speed Adversarial Vocoder in Time Domain with Very Low Computational Complexity
GAN vocoders are currently one of the state-of-the-art methods for building
high-quality neural waveform generative models. However, most of their
architectures require dozens of billion floating-point operations per second
(GFLOPS) to generate speech waveforms in samplewise manner. This makes GAN
vocoders still challenging to run on normal CPUs without accelerators or
parallel computers. In this work, we propose a new architecture for GAN
vocoders that mainly depends on recurrent and fully-connected networks to
directly generate the time domain signal in framewise manner. This results in
considerable reduction of the computational cost and enables very fast
generation on both GPUs and low-complexity CPUs. Experimental results show that
our Framewise WaveGAN vocoder achieves significantly higher quality than
auto-regressive maximum-likelihood vocoders such as LPCNet at a very low
complexity of 1.2 GFLOPS. This makes GAN vocoders more practical on edge and
low-power devices.Comment: Submitted to ICASSP 2023, demo:
https://ahmed-fau.github.io/fwgan_demo
NoLACE: Improving Low-Complexity Speech Codec Enhancement Through Adaptive Temporal Shaping
Speech codec enhancement methods are designed to remove distortions added by
speech codecs. While classical methods are very low in complexity and add zero
delay, their effectiveness is rather limited. Compared to that, DNN-based
methods deliver higher quality but they are typically high in complexity and/or
require delay. The recently proposed Linear Adaptive Coding Enhancer (LACE)
addresses this problem by combining DNNs with classical long-term/short-term
postfiltering resulting in a causal low-complexity model. A short-coming of the
LACE model is, however, that quality quickly saturates when the model size is
scaled up. To mitigate this problem, we propose a novel adatpive temporal
shaping module that adds high temporal resolution to the LACE model resulting
in the Non-Linear Adaptive Coding Enhancer (NoLACE). We adapt NoLACE to enhance
the Opus codec and show that NoLACE significantly outperforms both the Opus
baseline and an enlarged LACE model at 6, 9 and 12 kb/s. We also show that LACE
and NoLACE are well-behaved when used with an ASR system.Comment: submitted to ICASSP 202
NusaCrowd: Open Source Initiative for Indonesian NLP Resources
We present NusaCrowd, a collaborative initiative to collect and unify
existing resources for Indonesian languages, including opening access to
previously non-public resources. Through this initiative, we have brought
together 137 datasets and 118 standardized data loaders. The quality of the
datasets has been assessed manually and automatically, and their value is
demonstrated through multiple experiments. NusaCrowd's data collection enables
the creation of the first zero-shot benchmarks for natural language
understanding and generation in Indonesian and the local languages of
Indonesia. Furthermore, NusaCrowd brings the creation of the first multilingual
automatic speech recognition benchmark in Indonesian and the local languages of
Indonesia. Our work strives to advance natural language processing (NLP)
research for languages that are under-represented despite being widely spoken
Pronunciation modelling in end-to-end text-to-speech synthesis
Sequence-to-sequence (S2S) models in text-to-speech synthesis (TTS) can achieve
high-quality naturalness scores without extensive processing of text-input. Since S2S
models have been proposed in multiple aspects of the TTS pipeline, the field has focused
on embedding the pipeline toward End-to-End (E2E-) TTS where a waveform
is predicted directly from a sequence of text or phone characters. Early work on E2ETTS
in English, such as Char2Wav [1] and Tacotron [2], suggested that phonetisation
(lexicon-lookup and/or G2P modelling) could be implicitly learnt in a text-encoder
during training. The benefits of a learned text encoding include improved modelling
of phonetic context, which make contextual linguistic features traditionally used in
TTS pipelines redundant [3]. Subsequent work on E2E-TTS has since shown similar
naturalness scores with text- or phone-input (e.g. as in [4]). Successful modelling
of phonetic context has led some to question the benefit of using phone- instead of
text-input altogether (see [5]).
The use of text-input brings into question the value of the pronunciation lexicon
in E2E-TTS. Without phone-input, a S2S encoder learns an implicit grapheme-tophoneme
(G2P) model from text-audio pairs during training. With common datasets
for E2E-TTS in English, I simulated implicit G2P models, finding increased error rates
compared to a traditional, lexicon-based G2P model. Ultimately, successful G2P generalisation
is difficult for some words (e.g. foreign words and proper names) since
the knowledge to disambiguate their pronunciations may not be provided by the local
grapheme context and may require knowledge beyond that contained in sentence-level
text-audio sequences. When test stimuli were selected according to G2P difficulty,
increased mispronunciations in E2E-TTS with text-input were observed. Following
the proposed benefits of subword decomposition in S2S modelling in other language
tasks (e.g. neural machine translation), the effects of morphological decomposition
were investigated on pronunciation modelling. Learning of the French post-lexical
phenomenon liaison was also evaluated.
With the goal of an inexpensive, large-scale evaluation of pronunciation modelling,
the reliability of automatic speech recognition (ASR) to measure TTS intelligibility
was investigated. A re-evaluation of 6 years of results from the Blizzard Challenge
was conducted. ASR reliably found similar significant differences between systems
as paid listeners in controlled conditions in English. An analysis of transcriptions for
words exhibiting difficult-to-predict G2P relations was also conducted. The E2E-ASR
Transformer model used was found to be unreliable in its transcription of difficult G2P
relations due to homophonic transcription and incorrect transcription of words with
difficult G2P relations. A further evaluation of representation mixing in Tacotron finds
pronunciation correction is possible when mixing text- and phone-inputs. The thesis
concludes that there is still a place for the pronunciation lexicon in E2E-TTS as a
pronunciation guide since it can provide assurances that G2P generalisation cannot