175 research outputs found
Text to speech for Bangla language using festival
Includes bibliographical references (page 6-7).In this paper, we present a Text to Speech (TTS) synthesis system for Bangla language using the open-source Festival TTS engine. Festival is a complete TTS synthesis system, with components supporting front-end processing of the input text, language modeling, and speech synthesis using its signal processing module. The Bangla TTS system proposed here, creates the voice data for festival, and additionally extends festival using its embedded scheme
scripting interface to incorporate Bangla language support. Festival is a oncatenative TTS system using diphone or other unit selection speech units. Our TTS implementation uses two different kinds of these concatenative methods supported in Festival: unit selection and multisyn unit selection. The function of a Text-to-Speech system is to convert some language
text into its spoken equivalent by a series of modules. These modules, constituting the TTS system are described in detail which is very much helpful for future development. Finally, the quality of synthesized speech is assessed in terms of acceptability and intelligibility
Marathi Speech Synthesis: A Review
This paper seeks to reveal the various aspects of Marathi Speech synthesis. This paper has reviewed research development in the International languages as well as Indian languages and then centering on the development in Marathi languages with regard to other Indian languages. It is anticipated that this work will serve to explore more in Marathi language.
DOI: 10.17762/ijritcc2321-8169.15064
Framewise WaveGAN: High Speed Adversarial Vocoder in Time Domain with Very Low Computational Complexity
GAN vocoders are currently one of the state-of-the-art methods for building
high-quality neural waveform generative models. However, most of their
architectures require dozens of billion floating-point operations per second
(GFLOPS) to generate speech waveforms in samplewise manner. This makes GAN
vocoders still challenging to run on normal CPUs without accelerators or
parallel computers. In this work, we propose a new architecture for GAN
vocoders that mainly depends on recurrent and fully-connected networks to
directly generate the time domain signal in framewise manner. This results in
considerable reduction of the computational cost and enables very fast
generation on both GPUs and low-complexity CPUs. Experimental results show that
our Framewise WaveGAN vocoder achieves significantly higher quality than
auto-regressive maximum-likelihood vocoders such as LPCNet at a very low
complexity of 1.2 GFLOPS. This makes GAN vocoders more practical on edge and
low-power devices.Comment: Submitted to ICASSP 2023, demo:
https://ahmed-fau.github.io/fwgan_demo
An application of an auditory periphery model in speaker identification
The number of applications of automatic Speaker Identification (SID) is growing due to the advanced technologies for secure access and authentication in services and devices. In 2016, in a study, the Cascade of Asymmetric Resonators with Fast Acting Compression (CAR FAC) cochlear model achieved the best performance among seven recent cochlear models to fit a set of human auditory physiological data. Motivated by the performance of the CAR-FAC, I apply this cochlear model in an SID task for the first time to produce a similar performance to a human auditory system. This thesis investigates the potential of the CAR-FAC model in an SID task. I investigate the capability of the CAR-FAC in text-dependent and text-independent SID tasks. This thesis also investigates contributions of different parameters, nonlinearities, and stages of the CAR-FAC that enhance SID accuracy. The performance of the CAR-FAC is compared with another recent cochlear model called the Auditory Nerve (AN) model. In addition, three FFT-based auditory features – Mel frequency Cepstral Coefficient (MFCC), Frequency Domain Linear Prediction (FDLP), and Gammatone Frequency Cepstral Coefficient (GFCC), are also included to compare their performance with cochlear features. This comparison allows me to investigate a better front-end for a noise-robust SID system. Three different statistical classifiers: a Gaussian Mixture Model with Universal Background Model (GMM-UBM), a Support Vector Machine (SVM), and an I-vector were used to evaluate the performance. These statistical classifiers allow me to investigate nonlinearities in the cochlear front-ends. The performance is evaluated under clean and noisy conditions for a wide range of noise levels. Techniques to improve the performance of a cochlear algorithm are also investigated in this thesis. It was found that the application of a cube root and DCT on cochlear output enhances the SID accuracy substantially
Learning cross-lingual phonological and orthagraphic adaptations: a case study in improving neural machine translation between low-resource languages
Out-of-vocabulary (OOV) words can pose serious challenges for machine
translation (MT) tasks, and in particular, for low-resource language (LRL)
pairs, i.e., language pairs for which few or no parallel corpora exist. Our
work adapts variants of seq2seq models to perform transduction of such words
from Hindi to Bhojpuri (an LRL instance), learning from a set of cognate pairs
built from a bilingual dictionary of Hindi--Bhojpuri words. We demonstrate that
our models can be effectively used for language pairs that have limited
parallel corpora; our models work at the character level to grasp phonetic and
orthographic similarities across multiple types of word adaptations, whether
synchronic or diachronic, loan words or cognates. We describe the training
aspects of several character level NMT systems that we adapted to this task and
characterize their typical errors. Our method improves BLEU score by 6.3 on the
Hindi-to-Bhojpuri translation task. Further, we show that such transductions
can generalize well to other languages by applying it successfully to Hindi --
Bangla cognate pairs. Our work can be seen as an important step in the process
of: (i) resolving the OOV words problem arising in MT tasks, (ii) creating
effective parallel corpora for resource-constrained languages, and (iii)
leveraging the enhanced semantic knowledge captured by word-level embeddings to
perform character-level tasks.Comment: 47 pages, 4 figures, 21 tables (including Appendices
NoLACE: Improving Low-Complexity Speech Codec Enhancement Through Adaptive Temporal Shaping
Speech codec enhancement methods are designed to remove distortions added by
speech codecs. While classical methods are very low in complexity and add zero
delay, their effectiveness is rather limited. Compared to that, DNN-based
methods deliver higher quality but they are typically high in complexity and/or
require delay. The recently proposed Linear Adaptive Coding Enhancer (LACE)
addresses this problem by combining DNNs with classical long-term/short-term
postfiltering resulting in a causal low-complexity model. A short-coming of the
LACE model is, however, that quality quickly saturates when the model size is
scaled up. To mitigate this problem, we propose a novel adatpive temporal
shaping module that adds high temporal resolution to the LACE model resulting
in the Non-Linear Adaptive Coding Enhancer (NoLACE). We adapt NoLACE to enhance
the Opus codec and show that NoLACE significantly outperforms both the Opus
baseline and an enlarged LACE model at 6, 9 and 12 kb/s. We also show that LACE
and NoLACE are well-behaved when used with an ASR system.Comment: submitted to ICASSP 202
LACE: A light-weight, causal model for enhancing coded speech through adaptive convolutions
Classical speech coding uses low-complexity postfilters with zero lookahead
to enhance the quality of coded speech, but their effectiveness is limited by
their simplicity. Deep Neural Networks (DNNs) can be much more effective, but
require high complexity and model size, or added delay. We propose a DNN model
that generates classical filter kernels on a per-frame basis with a model of
just 300~K parameters and 100~MFLOPS complexity, which is a practical
complexity for desktop or mobile device CPUs. The lack of added delay allows it
to be integrated into the Opus codec, and we demonstrate that it enables
effective wideband encoding for bitrates down to 6 kb/s.Comment: 5 pages, accepted at WASPAA 202
- …