19,870 research outputs found
A Comparative Analysis of Pretrained Language Models for Text-to-Speech
State-of-the-art text-to-speech (TTS) systems have utilized pretrained
language models (PLMs) to enhance prosody and create more natural-sounding
speech. However, while PLMs have been extensively researched for natural
language understanding (NLU), their impact on TTS has been overlooked. In this
study, we aim to address this gap by conducting a comparative analysis of
different PLMs for two TTS tasks: prosody prediction and pause prediction.
Firstly, we trained a prosody prediction model using 15 different PLMs. Our
findings revealed a logarithmic relationship between model size and quality, as
well as significant performance differences between neutral and expressive
prosody. Secondly, we employed PLMs for pause prediction and found that the
task was less sensitive to small models. We also identified a strong
correlation between our empirical results and the GLUE scores obtained for
these language models. To the best of our knowledge, this is the first study of
its kind to investigate the impact of different PLMs on TTS.Comment: Accepted for presentation at the 12th ISCA Speech Synthesis Workshop
(SSW) in Grenoble, France, from 26th to 28th August 202
Common Premotor Regions for the Perception and Production of Prosody and Correlations with Empathy and Prosodic Ability
Background: Prosody, the melody and intonation of speech, involves the rhythm, rate, pitch and voice quality to relay linguistic and emotional information from one individual to another. A significant component of human social communication depends upon interpreting and responding to another personâs prosodic tone as well as oneâs own ability to produce prosodic speech. However there has been little work on whether the perception and production of prosody share common neural processes, and if so, how these might correlate with individual differences in social ability. Methods: The aim of the present study was to determine the degree to which perception and production of prosody rely on shared neural systems. Using fMRI, neural activity during perception and production of a meaningless phrase in different prosodic intonations was measured. Regions of overlap for production and perception of prosody were found in premotor regions, in particular the left inferior frontal gyrus (IFG). Activity in these regions was further found to correlate with how high an individual scored on two different measures of affective empathy as well as a measure on prosodic production ability. Conclusions: These data indicate, for the first time, that areas that are important for prosody production may also be utilized for prosody perception, as well as other aspects of social communication and social understanding, such as aspect
Speech synthesis, Speech simulation and speech science
Speech synthesis research has been transformed in recent years through the exploitation of speech corpora - both for statistical modelling and as a source of signals for concatenative synthesis. This revolution in methodology and the new techniques it brings calls into question the received wisdom that better computer voice output will come from a better understanding of how humans produce speech. This paper discusses the relationship between this new technology of simulated speech and the traditional aims of speech science. The paper suggests that the goal of speech simulation frees engineers from inadequate linguistic and physiological descriptions of speech. But at the same time, it leaves speech scientists free to return to their proper goal of building a computational model of human speech production
Universal and language-specific processing : the case of prosody
A key question in the science of language is how speech processing can be influenced by both language-universal and language-specific mechanisms (Cutler, Klein, & Levinson, 2005). My graduate research aimed to address this question by adopting a crosslanguage approach to compare languages with different phonological systems. Of all components of linguistic structure, prosody is often considered to be one of the most language-specific dimensions of speech. This can have significant implications for our understanding of language use, because much of speech processing is specifically tailored to the structure and requirements of the native language. However, it is still unclear whether prosody may also play a universal role across languages, and very little comparative attempts have been made to explore this possibility. In this thesis, I examined both the production and perception of prosodic cues to prominence and phrasing in native speakers of English and Mandarin Chinese. In focus production, our research revealed that English and Mandarin speakers were alike in how they used prosody to encode prominence, but there were also systematic language-specific differences in the exact degree to which they enhanced the different prosodic cues (Chapter 2). This, however, was not the case in focus perception, where English and Mandarin listeners were alike in the degree to which they used prosody to predict upcoming prominence, even though the precise cues in the preceding prosody could differ (Chapter 3). Further experiments examining prosodic focus prediction in the speech of different talkers have demonstrated functional cue equivalence in prosodic focus detection (Chapter 4). Likewise, our experiments have also revealed both crosslanguage similarities and differences in the production and perception of juncture cues (Chapter 5). Overall, prosodic processing is the result of a complex but subtle interplay of universal and language-specific structure
Integrating Prosodic and Lexical Cues for Automatic Topic Segmentation
We present a probabilistic model that uses both prosodic and lexical cues for
the automatic segmentation of speech into topically coherent units. We propose
two methods for combining lexical and prosodic information using hidden Markov
models and decision trees. Lexical information is obtained from a speech
recognizer, and prosodic features are extracted automatically from speech
waveforms. We evaluate our approach on the Broadcast News corpus, using the
DARPA-TDT evaluation metrics. Results show that the prosodic model alone is
competitive with word-based segmentation methods. Furthermore, we achieve a
significant reduction in error by combining the prosodic and word-based
knowledge sources.Comment: 27 pages, 8 figure
Prosodic Event Recognition using Convolutional Neural Networks with Context Information
This paper demonstrates the potential of convolutional neural networks (CNN)
for detecting and classifying prosodic events on words, specifically pitch
accents and phrase boundary tones, from frame-based acoustic features. Typical
approaches use not only feature representations of the word in question but
also its surrounding context. We show that adding position features indicating
the current word benefits the CNN. In addition, this paper discusses the
generalization from a speaker-dependent modelling approach to a
speaker-independent setup. The proposed method is simple and efficient and
yields strong results not only in speaker-dependent but also
speaker-independent cases.Comment: Interspeech 2017 4 pages, 1 figur
- âŠ