46 research outputs found

    Predicting phoneme-level prosody latents using AR and flow-based Prior Networks for expressive speech synthesis

    Full text link
    A large part of the expressive speech synthesis literature focuses on learning prosodic representations of the speech signal which are then modeled by a prior distribution during inference. In this paper, we compare different prior architectures at the task of predicting phoneme level prosodic representations extracted with an unsupervised FVAE model. We use both subjective and objective metrics to show that normalizing flow based prior networks can result in more expressive speech at the cost of a slight drop in quality. Furthermore, we show that the synthesized speech has higher variability, for a given text, due to the nature of normalizing flows. We also propose a Dynamical VAE model, that can generate higher quality speech although with decreased expressiveness and variability compared to the flow based models.Comment: Submitted to ICASSP 202

    Towards a vividness in synthesized speech for audiobooks

    Get PDF
    The goal of this study was to determine which acoustic parameters are significant in differentiating the speaking styles of a narrator and that of male and female characters as voiced by a reader of audiobooks. The study was initiated by a need to improve the expressivity and differentiation of speaking styles in fiction books read out by synthesized voices. The corpus used as research material was created from an audio novel, as read by a professional male voice artist. To determine whether it is possible to identify these speaking styles from the voice of the reader, a web-based perception test consisting of 48 sentences was conducted. The results showed that the listeners identified all three styles. For acoustic analysis, the openSMILE toolkit was used and 88 eGeMAPS-defined parameters were extracted for every sentence in the corpus. All styles were differentiated by 38 statistically significant parameters. To improve vividness, synthesizers aimed at reading fiction books could be trained to perform all three styles. Kokkuvõte. Hille Pajupuu, Rene Altrov ja Jaan Pajupuu: Teel audioraamatute sünteeskõne elavdamisele. Uurimuse eesmärk oli teada saada, milli sed olulisemad akustilised parameetrid eristavad audioraamatu lugeja hääles jutustaja kõnet ning mees- ja naistegelaste otsekõnet. Uurimuse tingis vajadus parandada sünteeshäälega loetavate juturaamatute väljendus rikkust ja kõnestiilide eristatavust. Uurimismaterjalina kasutati professionaalse meeshäälega loetud audioromaani „Tõde ja õigus I“ põhjal loodud korpust. Et teada saada, kas audioraamatu lugeja hääle põhjal on kuulaja võimeline eristama eri kõnestiile (jutustaja kõnet, mees- ja naistegelaste otsekõnet), koostati 48 lausest koosnev tajutest. Testi tulemused näitasid, et kuulajad tundsid ära kõik kolm kõnestiili. Akustiliseks analüüsiks kasutati kogu korpuse materjali. openSMILE’i tööriistaga ekstraheeriti kõnest iga lause jaoks 88 eGeMAPSis defineeritud parameetrit. Statistiliselt oluliselt eristasid kõnestiile 38 parameetrit, millest 18 oli seotud hääle kvaliteedi ja tämbriga, 11 hääle valjusega, 8 hääle kõrgusega ja 1 tempoga. Kuna tajutest ja akustiliste parameetrite analüüs näitasid, et audioraamatus eristusid nii jutustaja kõne, naistegelaste otsekõne kui ka meestegelaste otsekõne, võib pidada otstarbekaks õpetada juturaamatuid ettelugevaid süntesaatoreid esitama kõiki kolme kõnestiili. Märksõnad: audioraamatud, kõnestiil, otsekõne, karakteri kõne, GeMAPS, kõneanalüüs, ekspressiivne kõnesüntee

    The effect of GST announcement on stock market volatility: evidence from intraday data

    Get PDF
    Purpose – The purpose of this paper is to examine the effect of GST announcements (pre and post) on Malaysian stock market index. This study also utilised intraday data to look into intraday market volatility post-GST announcement. Design/methodology/approach – Both daily closing prices and intraday data of different frequencies are used to capture the extent of stock market volatility as well as the subsided period of the volatility. The period of study ranges from June 2009 to November 2016 and empirical estimation is based on the GARCH (1, 1) model for the pre- and post-GST announcements. Findings – Persistent market volatility in the post-GST announcement is empirically recorded and the volatility is higher in the post-GST announcement than the pre-GST announcement. This demonstrates the unwillingness and reaction of the market towards the tax policy implementation. Market expectation on GST implementation towards the increase in the cost of living following the increase in the prices of goods and services in Malaysia is empirically supported in the post-GST announcement. Practical implications – The finding on this study is consistent with the expectation of the market that GST implementation will increase the price of the goods and services and hence increase the cost of living. This is supported by a noticeable increase in the stock market volatility in the post-GST announcement. Although GST announcement could be classified as a scheduled announcement, unwillingness to accept the policy prevails as shown by the increase in the stock market volatility. Originality/value – The effects of Asian and global financial crisis are the major focus of past studies on stock market volatility, whereas this study examines and highlights the effect of the GST announcement on stock market volatility and the use of intraday data to further examine the nature of the volatility

    Methodology for the generation of artificial voices of children for inclusive education

    Get PDF
    La integración de voces artificiales en dispositivos tecnológicos es una opción para favorecer la comunicación en personas con discapacidad, ya que permite mayores herramientas para la inclusión de esta población, haciendo valer sus derechos establecidos en la Ley 8661. En la Universidad de Costa Rica se cuenta con un proyecto relacionado con tecnologías del habla para mejorar la calidad de vida de la población con discapacidad, inscrito desde Acción Social. Este proyecto apuesta por contribuir en los procesos de inclusión formulando sistemas de comunicación aumentativa con uso de voz artificial para niños, generando voces artificiales que se ajusten en género y edad a su propia identidad. Los avances recientes en tecnologías del habla, apoyados en sistemas que incorporan inteligencia artificial, hacen factible la generación de voces con sonido más flexibles, abriendo la posibilidad de crear voces personalizadas de acuerdo con acentos y condiciones específicas. Uno de los mayores retos que se tienen es la obtención de datos de calidad, a partir de voces naturales de niños, para poder emularlos con ayuda de la computadora. En la presente ponencia mostramos el diseño de datos y la estrategias de interacción con niños, para grabar sus voces de manera que sean aprovechables para crear voces artificiales nuevas que se puedan aplicar en sistemas de comunicación aumentativa que promuevan el acceso de sus usuarios a una participación activa en sus propios procesos de aprendizaje.UCR::Vicerrectoría de Docencia::Ciencias Sociales::Facultad de Educación::Escuela de Orientación y Educación EspecialUCR::Vicerrectoría de Docencia::Ingeniería::Facultad de Ingeniería::Escuela de Ingeniería Eléctric

    Diasemiotic translation of neuro-diagnostic tools into Greek

    Get PDF
    Translated neuro-diagnostic tools, which assess cognitive skills, call for authors, readers, translators, evaluators for the quality of translations, neuro-scientists and examinees as users.

    A Study of Modeling Rising Intonation in Cantonese Neural Speech Synthesis

    Full text link
    In human speech, the attitude of a speaker cannot be fully expressed only by the textual content. It has to come along with the intonation. Declarative questions are commonly used in daily Cantonese conversations, and they are usually uttered with rising intonation. Vanilla neural text-to-speech (TTS) systems are not capable of synthesizing rising intonation for these sentences due to the loss of semantic information. Though it has become more common to complement the systems with extra language models, their performance in modeling rising intonation is not well studied. In this paper, we propose to complement the Cantonese TTS model with a BERT-based statement/question classifier. We design different training strategies and compare their performance. We conduct our experiments on a Cantonese corpus named CanTTS. Empirical results show that the separate training approach obtains the best generalization performance and feasibility.Comment: Accepted by INTERSPEECH 202

    LE-SSL-MOS: Self-Supervised Learning MOS Prediction with Listener Enhancement

    Full text link
    Recently, researchers have shown an increasing interest in automatically predicting the subjective evaluation for speech synthesis systems. This prediction is a challenging task, especially on the out-of-domain test set. In this paper, we proposed a novel fusion model for MOS prediction that combines supervised and unsupervised approaches. In the supervised aspect, we developed an SSL-based predictor called LE-SSL-MOS. The LE-SSL-MOS utilizes pre-trained self-supervised learning models and further improves prediction accuracy by utilizing the opinion scores of each utterance in the listener enhancement branch. In the unsupervised aspect, two steps are contained: we fine-tuned the unit language model (ULM) using highly intelligible domain data to improve the correlation of an unsupervised metric - SpeechLMScore. Another is that we utilized ASR confidence as a new metric with the help of ensemble learning. To our knowledge, this is the first architecture that fuses supervised and unsupervised methods for MOS prediction. With these approaches, our experimental results on the VoiceMOS Challenge 2023 show that LE-SSL-MOS performs better than the baseline. Our fusion system achieved an absolute improvement of 13% over LE-SSL-MOS on the noisy and enhanced speech track. Our system ranked 1st and 2nd, respectively, in the French speech synthesis track and the challenge's noisy and enhanced speech track.Comment: accepted in IEEE-ASRU202

    Speech synthesis : Developing a web application implementing speech technology

    Get PDF
    Speech is a natural media of communication for humans. Text-to-speech (TTS) technology uses a computer to synthesize speech. There are three main techniques of TTS synthesis. These are formant-based, articulatory and concatenative. The application areas of TTS include accessibility, education, entertainment and communication aid in mass transit. A web application was developed to demonstrate the application of speech synthesis technology. Existing speech synthesis engines for the Finnish language were compared and two open source text to speech engines, Festival and Espeak were selected to be used with the web application. The application uses a Linux-based speech server which communicates with client devices with the HTTP-GET protocol. The application development successfully demonstrated the use of speech synthesis in language learning. One of the emerging sectors of speech technologies is the mobile market due to limited input capabilities in mobile devices. Speech technologies are not equally available in all languages. Text in the Oromo language was tested using Finnish speech synthesizers; due to similar rules in orthography of germination of consonants and length of vowels, legible results were gained

    LightGrad: Lightweight Diffusion Probabilistic Model for Text-to-Speech

    Full text link
    Recent advances in neural text-to-speech (TTS) models bring thousands of TTS applications into daily life, where models are deployed in cloud to provide services for customs. Among these models are diffusion probabilistic models (DPMs), which can be stably trained and are more parameter-efficient compared with other generative models. As transmitting data between customs and the cloud introduces high latency and the risk of exposing private data, deploying TTS models on edge devices is preferred. When implementing DPMs onto edge devices, there are two practical problems. First, current DPMs are not lightweight enough for resource-constrained devices. Second, DPMs require many denoising steps in inference, which increases latency. In this work, we present LightGrad, a lightweight DPM for TTS. LightGrad is equipped with a lightweight U-Net diffusion decoder and a training-free fast sampling technique, reducing both model parameters and inference latency. Streaming inference is also implemented in LightGrad to reduce latency further. Compared with Grad-TTS, LightGrad achieves 62.2% reduction in paramters, 65.7% reduction in latency, while preserving comparable speech quality on both Chinese Mandarin and English in 4 denoising steps.Comment: Accepted by ICASSP 202

    Harmonic Gradience in Greek Rap Rhymes

    Get PDF
    This study investigates the gradience in mismatch acceptability in Greek rap Imperfect Rhyme. We consider that the rhyme domains of a rhyming pair are in a Base-Reduplicant correspondence relationship, requiring segmental identity in Place of Articulation, Manner of Articulation and voicing, and manifesting a gradient acceptability of featural mismatches. Analysis shows that voicing mismatches are highly marked in Greek rap rhyme, implying a high perceptual salience, which seems to be language-specific. Mismatches only in Place of Articulation seem to be the most harmonic and frequent mismatch pattern, while Harmony and similarity decrease in inverse proportion to the number of mismatching features. Mismatches in Greek rap rhyme are, in principle, accepted between unmarked corresponding consonants. Obstruents, mainly Stops mismatch in Place of Articulation, and Coronals mismatch in Manner of Articulation. In general, mainly Coronal Obstruents and, to a lesser extent, Nasals are involved in mismatches. Due to the attested high avoidance of voicing mismatches, which are acoustically non-salient, we propose that, in Standard Modern Greek, perceptual salience is not purely phonetic, as it seems to be also based on the language-specific phonological grammar
    corecore