5 research outputs found
Predicting Prosodic Prominence from Text with Pre-trained Contextualized Word Representations
In this paper we introduce a new natural language processing dataset and benchmark for predicting prosodic prominence from written text. To our knowledge this will be the largest publicly available dataset with prosodic labels. We describe the dataset construction and the resulting benchmark dataset in detail and train a number of different models ranging from feature-based classifiers to neural network systems for the prediction of discretized prosodic prominence. We show that pre-trained contextualized word representations from BERT outperform the other models even with less than 10% of the training data. Finally we discuss the dataset in light of the results and point to future research and plans for further improving both the dataset and methods of predicting prosodic prominence from text. The dataset and the code for the models are publicly available.Peer reviewe
DPP-TTS: Diversifying prosodic features of speech via determinantal point processes
With the rapid advancement in deep generative models, recent neural
Text-To-Speech(TTS) models have succeeded in synthesizing human-like speech.
There have been some efforts to generate speech with various prosody beyond
monotonous prosody patterns. However, previous works have several limitations.
First, typical TTS models depend on the scaled sampling temperature for
boosting the diversity of prosody. Speech samples generated at high sampling
temperatures often lack perceptual prosodic diversity, which can adversely
affect the naturalness of the speech. Second, the diversity among samples is
neglected since the sampling procedure often focuses on a single speech sample
rather than multiple ones. In this paper, we propose DPP-TTS: a text-to-speech
model based on Determinantal Point Processes (DPPs) with a prosody diversifying
module. Our TTS model is capable of generating speech samples that
simultaneously consider perceptual diversity in each sample and among multiple
samples. We demonstrate that DPP-TTS generates speech samples with more
diversified prosody than baselines in the side-by-side comparison test
considering the naturalness of speech at the same time.Comment: EMNLP 202
Quantifying the redundancy between prosody and text
Prosody -- the suprasegmental component of speech, including pitch, loudness,
and tempo -- carries critical aspects of meaning. However, the relationship
between the information conveyed by prosody vs. by the words themselves remains
poorly understood. We use large language models (LLMs) to estimate how much
information is redundant between prosody and the words themselves. Using a
large spoken corpus of English audiobooks, we extract prosodic features aligned
to individual words and test how well they can be predicted from LLM
embeddings, compared to non-contextual word embeddings. We find a high degree
of redundancy between the information carried by the words and prosodic
information across several prosodic features, including intensity, duration,
pauses, and pitch contours. Furthermore, a word's prosodic information is
redundant with both the word itself and the context preceding as well as
following it. Still, we observe that prosodic features can not be fully
predicted from text, suggesting that prosody carries information above and
beyond the words. Along with this paper, we release a general-purpose data
processing pipeline for quantifying the relationship between linguistic
information and extra-linguistic features.Comment: Published at The 2023 Conference on Empirical Methods in Natural
Language Processing (EMNLP