6,295 research outputs found
Including Pitch Accent Optionality in Unit Selection Text-to-Speech Synthesis
A significant variability in pitch accent placement is found when comparing the patterns of prosodic prominence realized by different English speakers reading the same sentences. In this paper we describe a simple approach to incorporate this variability to synthesize prosodic prominence in unit selection text-to-speech synthesis. The main motivation of our approach is that by taking into account the variability of accent placements we enlarge the set of prosodically acceptable speech units, thus increasing the chances of selecting a good quality sequence of units, both in prosodic and segmental terms. Results on a large scale perceptual test show the benefits of our approach and indicate directions for further improvements. Index Terms: speech synthesis, unit selection, prosodic prominence, pitch accent
Prosody-Based Automatic Segmentation of Speech into Sentences and Topics
A crucial step in processing speech audio data for information extraction,
topic detection, or browsing/playback is to segment the input into sentence and
topic units. Speech segmentation is challenging, since the cues typically
present for segmenting text (headers, paragraphs, punctuation) are absent in
spoken language. We investigate the use of prosody (information gleaned from
the timing and melody of speech) for these tasks. Using decision tree and
hidden Markov modeling techniques, we combine prosodic cues with word-based
approaches, and evaluate performance on two speech corpora, Broadcast News and
Switchboard. Results show that the prosodic model alone performs on par with,
or better than, word-based statistical language models -- for both true and
automatically recognized words in news speech. The prosodic model achieves
comparable performance with significantly less training data, and requires no
hand-labeling of prosodic events. Across tasks and corpora, we obtain a
significant improvement over word-only models using a probabilistic combination
of prosodic and lexical information. Inspection reveals that the prosodic
models capture language-independent boundary indicators described in the
literature. Finally, cue usage is task and corpus dependent. For example, pause
and pitch features are highly informative for segmenting news speech, whereas
pause, duration and word-based cues dominate for natural conversation.Comment: 30 pages, 9 figures. To appear in Speech Communication 32(1-2),
Special Issue on Accessing Information in Spoken Audio, September 200
Natural TTS Synthesis by Conditioning WaveNet on Mel Spectrogram Predictions
This paper describes Tacotron 2, a neural network architecture for speech
synthesis directly from text. The system is composed of a recurrent
sequence-to-sequence feature prediction network that maps character embeddings
to mel-scale spectrograms, followed by a modified WaveNet model acting as a
vocoder to synthesize timedomain waveforms from those spectrograms. Our model
achieves a mean opinion score (MOS) of comparable to a MOS of for
professionally recorded speech. To validate our design choices, we present
ablation studies of key components of our system and evaluate the impact of
using mel spectrograms as the input to WaveNet instead of linguistic, duration,
and features. We further demonstrate that using a compact acoustic
intermediate representation enables significant simplification of the WaveNet
architecture.Comment: Accepted to ICASSP 201
Recognizing Uncertainty in Speech
We address the problem of inferring a speaker's level of certainty based on
prosodic information in the speech signal, which has application in
speech-based dialogue systems. We show that using phrase-level prosodic
features centered around the phrases causing uncertainty, in addition to
utterance-level prosodic features, improves our model's level of certainty
classification. In addition, our models can be used to predict which phrase a
person is uncertain about. These results rely on a novel method for eliciting
utterances of varying levels of certainty that allows us to compare the utility
of contextually-based feature sets. We elicit level of certainty ratings from
both the speakers themselves and a panel of listeners, finding that there is
often a mismatch between speakers' internal states and their perceived states,
and highlighting the importance of this distinction.Comment: 11 page
- âŠ