46 research outputs found
Predicting phoneme-level prosody latents using AR and flow-based Prior Networks for expressive speech synthesis
A large part of the expressive speech synthesis literature focuses on
learning prosodic representations of the speech signal which are then modeled
by a prior distribution during inference. In this paper, we compare different
prior architectures at the task of predicting phoneme level prosodic
representations extracted with an unsupervised FVAE model. We use both
subjective and objective metrics to show that normalizing flow based prior
networks can result in more expressive speech at the cost of a slight drop in
quality. Furthermore, we show that the synthesized speech has higher
variability, for a given text, due to the nature of normalizing flows. We also
propose a Dynamical VAE model, that can generate higher quality speech although
with decreased expressiveness and variability compared to the flow based
models.Comment: Submitted to ICASSP 202
Towards a vividness in synthesized speech for audiobooks
The goal of this study was to determine which acoustic parameters are significant in differentiating the speaking styles of a narrator and that of male and female characters as voiced by a reader of audiobooks. The study was initiated by a need to improve the expressivity and differentiation of speaking styles in fiction books read out by synthesized voices. The corpus used as research material was created from an audio novel, as read by a professional male voice artist. To determine whether it is possible to identify these speaking styles from the voice of the reader, a web-based perception test consisting of 48 sentences was conducted. The results showed that the listeners identified all three styles. For acoustic analysis, the openSMILE toolkit was used and 88 eGeMAPS-defined parameters were extracted for every sentence in the corpus. All styles were differentiated by 38 statistically significant parameters. To improve vividness, synthesizers aimed at reading fiction books could be trained to perform all three styles.
Kokkuvõte. Hille Pajupuu, Rene Altrov ja Jaan Pajupuu: Teel audioraamatute sünteeskõne elavdamisele. Uurimuse eesmärk oli teada saada, milli sed olulisemad akustilised parameetrid eristavad audioraamatu lugeja hääles jutustaja kõnet ning mees- ja naistegelaste otsekõnet. Uurimuse tingis vajadus parandada sünteeshäälega loetavate juturaamatute väljendus rikkust ja kõnestiilide eristatavust. Uurimismaterjalina kasutati professionaalse meeshäälega loetud audioromaani „Tõde ja õigus I“ põhjal loodud korpust. Et teada saada, kas audioraamatu lugeja hääle põhjal on kuulaja võimeline eristama eri kõnestiile (jutustaja kõnet, mees- ja naistegelaste otsekõnet), koostati 48 lausest koosnev tajutest. Testi tulemused näitasid, et kuulajad tundsid ära kõik kolm kõnestiili. Akustiliseks analüüsiks kasutati kogu korpuse materjali. openSMILE’i tööriistaga ekstraheeriti kõnest iga lause jaoks 88 eGeMAPSis defineeritud parameetrit. Statistiliselt oluliselt eristasid kõnestiile 38 parameetrit, millest 18 oli seotud hääle kvaliteedi ja tämbriga, 11 hääle valjusega, 8 hääle kõrgusega ja 1 tempoga. Kuna tajutest ja akustiliste parameetrite analüüs näitasid, et audioraamatus eristusid nii jutustaja kõne, naistegelaste otsekõne kui ka meestegelaste otsekõne, võib pidada otstarbekaks õpetada juturaamatuid ettelugevaid süntesaatoreid esitama kõiki kolme kõnestiili.
Märksõnad: audioraamatud, kõnestiil, otsekõne, karakteri kõne, GeMAPS, kõneanalüüs, ekspressiivne kõnesüntee
The effect of GST announcement on stock market volatility: evidence from intraday data
Purpose – The purpose of this paper is to examine the effect of GST announcements (pre and post) on
Malaysian stock market index. This study also utilised intraday data to look into intraday market volatility
post-GST announcement.
Design/methodology/approach – Both daily closing prices and intraday data of different frequencies are
used to capture the extent of stock market volatility as well as the subsided period of the volatility. The period
of study ranges from June 2009 to November 2016 and empirical estimation is based on the GARCH (1, 1)
model for the pre- and post-GST announcements.
Findings – Persistent market volatility in the post-GST announcement is empirically recorded and the
volatility is higher in the post-GST announcement than the pre-GST announcement. This demonstrates the
unwillingness and reaction of the market towards the tax policy implementation. Market expectation on GST
implementation towards the increase in the cost of living following the increase in the prices of goods and
services in Malaysia is empirically supported in the post-GST announcement.
Practical implications – The finding on this study is consistent with the expectation of the market that
GST implementation will increase the price of the goods and services and hence increase the cost of living.
This is supported by a noticeable increase in the stock market volatility in the post-GST announcement.
Although GST announcement could be classified as a scheduled announcement, unwillingness to accept the
policy prevails as shown by the increase in the stock market volatility.
Originality/value – The effects of Asian and global financial crisis are the major focus of past studies on
stock market volatility, whereas this study examines and highlights the effect of the GST announcement on
stock market volatility and the use of intraday data to further examine the nature of the volatility
Methodology for the generation of artificial voices of children for inclusive education
La integración de voces artificiales en dispositivos tecnológicos es una opción para favorecer la comunicación en personas con discapacidad, ya que permite mayores herramientas para la inclusión de esta población, haciendo valer sus derechos establecidos en la Ley 8661. En la Universidad de Costa Rica se cuenta con un proyecto relacionado con tecnologías del habla para mejorar la calidad de vida de la población con discapacidad, inscrito desde Acción Social. Este proyecto apuesta por contribuir en los procesos de inclusión formulando sistemas de comunicación aumentativa con uso de voz artificial para niños, generando voces artificiales que se ajusten en género y edad a su propia identidad. Los avances recientes en tecnologías del habla, apoyados en sistemas que incorporan inteligencia artificial, hacen factible la generación de voces con sonido más flexibles, abriendo la posibilidad de crear voces personalizadas de acuerdo con acentos y condiciones específicas. Uno de los mayores retos que se tienen es la obtención de datos de calidad, a partir de voces naturales de niños, para poder emularlos con ayuda de la computadora. En la presente ponencia mostramos el diseño de datos y la estrategias de interacción con niños, para grabar sus voces de manera que sean aprovechables para crear voces artificiales nuevas que se puedan aplicar en sistemas de comunicación aumentativa que promuevan el acceso de sus usuarios a una participación activa en sus propios procesos de aprendizaje.UCR::Vicerrectoría de Docencia::Ciencias Sociales::Facultad de Educación::Escuela de Orientación y Educación EspecialUCR::Vicerrectoría de Docencia::Ingeniería::Facultad de Ingeniería::Escuela de Ingeniería Eléctric
Diasemiotic translation of neuro-diagnostic tools into Greek
Translated neuro-diagnostic tools, which assess cognitive skills, call for authors, readers, translators, evaluators for the quality of translations, neuro-scientists and examinees as users.
A Study of Modeling Rising Intonation in Cantonese Neural Speech Synthesis
In human speech, the attitude of a speaker cannot be fully expressed only by
the textual content. It has to come along with the intonation. Declarative
questions are commonly used in daily Cantonese conversations, and they are
usually uttered with rising intonation. Vanilla neural text-to-speech (TTS)
systems are not capable of synthesizing rising intonation for these sentences
due to the loss of semantic information. Though it has become more common to
complement the systems with extra language models, their performance in
modeling rising intonation is not well studied. In this paper, we propose to
complement the Cantonese TTS model with a BERT-based statement/question
classifier. We design different training strategies and compare their
performance. We conduct our experiments on a Cantonese corpus named CanTTS.
Empirical results show that the separate training approach obtains the best
generalization performance and feasibility.Comment: Accepted by INTERSPEECH 202
LE-SSL-MOS: Self-Supervised Learning MOS Prediction with Listener Enhancement
Recently, researchers have shown an increasing interest in automatically
predicting the subjective evaluation for speech synthesis systems. This
prediction is a challenging task, especially on the out-of-domain test set. In
this paper, we proposed a novel fusion model for MOS prediction that combines
supervised and unsupervised approaches. In the supervised aspect, we developed
an SSL-based predictor called LE-SSL-MOS. The LE-SSL-MOS utilizes pre-trained
self-supervised learning models and further improves prediction accuracy by
utilizing the opinion scores of each utterance in the listener enhancement
branch. In the unsupervised aspect, two steps are contained: we fine-tuned the
unit language model (ULM) using highly intelligible domain data to improve the
correlation of an unsupervised metric - SpeechLMScore. Another is that we
utilized ASR confidence as a new metric with the help of ensemble learning. To
our knowledge, this is the first architecture that fuses supervised and
unsupervised methods for MOS prediction. With these approaches, our
experimental results on the VoiceMOS Challenge 2023 show that LE-SSL-MOS
performs better than the baseline. Our fusion system achieved an absolute
improvement of 13% over LE-SSL-MOS on the noisy and enhanced speech track. Our
system ranked 1st and 2nd, respectively, in the French speech synthesis track
and the challenge's noisy and enhanced speech track.Comment: accepted in IEEE-ASRU202
Speech synthesis : Developing a web application implementing speech technology
Speech is a natural media of communication for humans. Text-to-speech (TTS) technology uses a computer to synthesize speech. There are three main techniques of TTS synthesis. These are formant-based, articulatory and concatenative. The application areas of TTS include accessibility, education, entertainment and communication aid in mass transit.
A web application was developed to demonstrate the application of speech synthesis technology. Existing speech synthesis engines for the Finnish language were compared and two open source text to speech engines, Festival and Espeak were selected to be used with the web application. The application uses a Linux-based speech server which communicates with client devices with the HTTP-GET protocol.
The application development successfully demonstrated the use of speech synthesis in language learning. One of the emerging sectors of speech technologies is the mobile market due to limited input capabilities in mobile devices. Speech technologies are not equally available in all languages. Text in the Oromo language was tested using Finnish speech synthesizers; due to similar rules in orthography of germination of consonants and length of vowels, legible results were gained
LightGrad: Lightweight Diffusion Probabilistic Model for Text-to-Speech
Recent advances in neural text-to-speech (TTS) models bring thousands of TTS
applications into daily life, where models are deployed in cloud to provide
services for customs. Among these models are diffusion probabilistic models
(DPMs), which can be stably trained and are more parameter-efficient compared
with other generative models. As transmitting data between customs and the
cloud introduces high latency and the risk of exposing private data, deploying
TTS models on edge devices is preferred. When implementing DPMs onto edge
devices, there are two practical problems. First, current DPMs are not
lightweight enough for resource-constrained devices. Second, DPMs require many
denoising steps in inference, which increases latency. In this work, we present
LightGrad, a lightweight DPM for TTS. LightGrad is equipped with a lightweight
U-Net diffusion decoder and a training-free fast sampling technique, reducing
both model parameters and inference latency. Streaming inference is also
implemented in LightGrad to reduce latency further. Compared with Grad-TTS,
LightGrad achieves 62.2% reduction in paramters, 65.7% reduction in latency,
while preserving comparable speech quality on both Chinese Mandarin and English
in 4 denoising steps.Comment: Accepted by ICASSP 202
Harmonic Gradience in Greek Rap Rhymes
This study investigates the gradience in mismatch acceptability in Greek rap Imperfect Rhyme. We consider that the rhyme domains of a rhyming pair are in a Base-Reduplicant correspondence relationship, requiring segmental identity in Place of Articulation, Manner of Articulation and voicing, and manifesting a gradient acceptability of featural mismatches. Analysis shows that voicing mismatches are highly marked in Greek rap rhyme, implying a high perceptual salience, which seems to be language-specific. Mismatches only in Place of Articulation seem to be the most harmonic and frequent mismatch pattern, while Harmony and similarity decrease in inverse proportion to the number of mismatching features. Mismatches in Greek rap rhyme are, in principle, accepted between unmarked corresponding consonants. Obstruents, mainly Stops mismatch in Place of Articulation, and Coronals mismatch in Manner of Articulation. In general, mainly Coronal Obstruents and, to a lesser extent, Nasals are involved in mismatches. Due to the attested high avoidance of voicing mismatches, which are acoustically non-salient, we propose that, in Standard Modern Greek, perceptual salience is not purely phonetic, as it seems to be also based on the language-specific phonological grammar