1,238 research outputs found
A Transfer Learning End-to-End ArabicText-To-Speech (TTS) Deep Architecture
Speech synthesis is the artificial production of human speech. A typical
text-to-speech system converts a language text into a waveform. There exist
many English TTS systems that produce mature, natural, and human-like speech
synthesizers. In contrast, other languages, including Arabic, have not been
considered until recently. Existing Arabic speech synthesis solutions are slow,
of low quality, and the naturalness of synthesized speech is inferior to the
English synthesizers. They also lack essential speech key factors such as
intonation, stress, and rhythm. Different works were proposed to solve those
issues, including the use of concatenative methods such as unit selection or
parametric methods. However, they required a lot of laborious work and domain
expertise. Another reason for such poor performance of Arabic speech
synthesizers is the lack of speech corpora, unlike English that has many
publicly available corpora and audiobooks. This work describes how to generate
high quality, natural, and human-like Arabic speech using an end-to-end neural
deep network architecture. This work uses just text, audio
pairs with a relatively small amount of recorded audio samples with a total of
2.41 hours. It illustrates how to use English character embedding despite using
diacritic Arabic characters as input and how to preprocess these audio samples
to achieve the best results
DNN-Based Speech Synthesis for Arabic: Modelling and Evaluation
International audienceThis paper investigates the use of deep neural networks (DNN) for Arabic speech synthesis. In parametric speech synthesis, whether HMM-based or DNN-based, each speech segment is described with a set of contextual features. These contextual features correspond to linguistic, phonetic and prosodic information that may affect the pronunciation of the segments. Gemination and vowel quantity (short vowel vs. long vowel) are two particular and important phenomena in Arabic language. Hence, it is worth investigating if those phenomena must be handled by using specific speech units, or if their specification in the contextual features is enough. Consequently four modelling approaches are evaluated by considering geminated consonants (respectively long vowels) either as fully-fledged phoneme units or as the same phoneme as their simple (respectively short) counterparts. Although no significant difference has been observed in previous studies relying on HMM-based modelling, this paper examines these modelling variants in the framework of DNN-based speech synthesis. Listening tests are conducted to evaluate the four modelling approaches, and to assess the performance of DNN-based Arabic speech synthesis with respect to previous HMM-based approach
A prior case study of natural language processing on different domain
In the present state of digital world, computer machine do not understand the human’s ordinary language. This is the great barrier between humans and digital systems. Hence, researchers found an advanced technology that provides information to the users from the digital machine. However, natural language processing (i.e. NLP) is a branch of AI that has significant implication on the ways that computer machine and humans can interact. NLP has become an essential technology in bridging the communication gap between humans and digital data. Thus, this study provides the necessity of the NLP in the current computing world along with different approaches and their applications. It also, highlights the key challenges in the development of new NLP model
ArTST: Arabic Text and Speech Transformer
We present ArTST, a pre-trained Arabic text and speech transformer for
supporting open-source speech technologies for the Arabic language. The model
architecture follows the unified-modal framework, SpeechT5, that was recently
released for English, and is focused on Modern Standard Arabic (MSA), with
plans to extend the model for dialectal and code-switched Arabic in future
editions. We pre-trained the model from scratch on MSA speech and text data,
and fine-tuned it for the following tasks: Automatic Speech Recognition (ASR),
Text-To-Speech synthesis (TTS), and spoken dialect identification. In our
experiments comparing ArTST with SpeechT5, as well as with previously reported
results in these tasks, ArTST performs on a par with or exceeding the current
state-of-the-art in all three tasks. Moreover, we find that our pre-training is
conducive for generalization, which is particularly evident in the low-resource
TTS task. The pre-trained model as well as the fine-tuned ASR and TTS models
are released for research use.Comment: 11 pages, 1 figure, SIGARAB ArabicNLP 202
Automatic transcription and phonetic labelling of dyslexic children's reading in Bahasa Melayu
Automatic speech recognition (ASR) is potentially helpful for children who suffer
from dyslexia. Highly phonetically similar errors of dyslexic children‟s reading affect the accuracy of ASR. Thus, this study aims to evaluate acceptable accuracy of ASR using automatic transcription and phonetic labelling of dyslexic children‟s reading in BM. For that, three objectives have been set: first to produce manual transcription and phonetic labelling; second to construct automatic transcription and phonetic labelling using forced alignment; and third to compare between accuracy using automatic transcription and phonetic labelling and manual transcription and
phonetic labelling. Therefore, to accomplish these goals methods have been used including manual speech labelling and segmentation, forced alignment, Hidden Markov Model (HMM) and Artificial Neural Network (ANN) for training, and for measure accuracy of ASR, Word Error Rate (WER) and False Alarm Rate (FAR) were used. A number of 585 speech files are used for manual transcription, forced alignment and training experiment. The recognition ASR engine using automatic transcription and phonetic labelling obtained optimum results is 76.04% with WER as low as 23.96% and FAR is 17.9%. These results are almost similar with ASR
engine using manual transcription namely 76.26%, WER as low as 23.97% and FAR a 17.9%. As conclusion, the accuracy of automatic transcription and phonetic labelling is acceptable to use it for help dyslexic children learning using ASR in Bahasa Melayu (BM
Automatic Pronunciation Assessment -- A Review
Pronunciation assessment and its application in computer-aided pronunciation
training (CAPT) have seen impressive progress in recent years. With the rapid
growth in language processing and deep learning over the past few years, there
is a need for an updated review. In this paper, we review methods employed in
pronunciation assessment for both phonemic and prosodic. We categorize the main
challenges observed in prominent research trends, and highlight existing
limitations, and available resources. This is followed by a discussion of the
remaining challenges and possible directions for future work.Comment: 9 pages, accepted to EMNLP Finding
Statistical modelling of speech units in HMM-based speech synthesis for Arabic
International audienceThis paper investigates statistical parametric speech synthesis of Modern Standard Arabic (MSA). Hidden Markov Models (HMM)-based speech synthesis system relies on a description of speech segments corresponding to phonemes, with a large set of features that represent phonetic, phonologic, linguistic and contextual aspects. When applied to MSA two specific phenomena have to be taken in account, the vowel lengthening and the consonant gemination. This paper studies thoroughly the modeling of these phenomena through various approaches: as for example, the use of different units for modeling short vs. long vowels and the use of different units for modeling simple vs. geminated consonants. These approaches are compared to another one which merges short and long variants of a vowel into a single unit and, simple and geminated variants of a consonant into a single unit (these characteristics being handled through the features associated to the sound). Results of subjective evaluation show that there is no significant difference between using the same unit for simple and geminated consonant (as well as for short and long vowels) and using different units for simple vs. geminated consonants (as well for short vs. long vowels)
Duration modeling using DNN for Arabic speech synthesis
International audienceDuration modeling is a key task for every parametric speech synthesis system. Though such parametric systems have been adapted to many languages, no special attention was paid to explicitly handling Arabic speech characteristics. Actually, in Arabic phoneme duration has a distinctive role, because of consonant gemination and vowel quantity. Therefore, a precise modeling of sound durations is critical. In this paper we compare several modeling of phoneme durations (including duration modeling by HTS and MERLIN toolkits), and we propose a new approach which relies on using a set of models, each one being optimal for a given phoneme class (e.g., simple consonants, geminated consonants, short vowels, and long vowels). An objective evaluation carried out on a set of test sentences shows that the proposed approach leads to a more accurate modeling of the phoneme durations
- …