232 research outputs found
Predicting Prosodic Prominence from Text with Pre-trained Contextualized Word Representations
In this paper we introduce a new natural language processing dataset and benchmark for predicting prosodic prominence from written text. To our knowledge this will be the largest publicly available dataset with prosodic labels. We describe the dataset construction and the resulting benchmark dataset in detail and train a number of different models ranging from feature-based classifiers to neural network systems for the prediction of discretized prosodic prominence. We show that pre-trained contextualized word representations from BERT outperform the other models even with less than 10% of the training data. Finally we discuss the dataset in light of the results and point to future research and plans for further improving both the dataset and methods of predicting prosodic prominence from text. The dataset and the code for the models are publicly available.Peer reviewe
Prosodic Prominence and Boundaries in Sequence-to-Sequence Speech Synthesis
Recent advances in deep learning methods have elevated synthetic speech quality to human level, and the field is now moving towards addressing prosodic variation in synthetic speech.Despite successes in this effort, the state-of-the-art systems fall short of faithfully reproducing local prosodic events that give rise to, e.g., word-level emphasis and phrasal structure. This type of prosodic variation often reflects long-distance semantic relationships that are not accessible for end-to-end systems with a single sentence as their synthesis domain. One of the possible solutions might be conditioning the synthesized speech by explicit prosodic labels, potentially generated using longer portions of text. In this work we evaluate whether augmenting the textual input with such prosodic labels capturing word-level prominence and phrasal boundary strength can result in more accurate realization of sentence prosody. We use an automatic wavelet-based technique to extract such labels from speech material, and use them as an input to a tacotron-like synthesis system alongside textual information. The results of objective evaluation of synthesized speech show that using the prosodic labels significantly improves the output in terms of faithfulness of f0 and energy contours, in comparison with state-of-the-art implementations.Peer reviewe
Quantifying the redundancy between prosody and text
Prosody -- the suprasegmental component of speech, including pitch, loudness,
and tempo -- carries critical aspects of meaning. However, the relationship
between the information conveyed by prosody vs. by the words themselves remains
poorly understood. We use large language models (LLMs) to estimate how much
information is redundant between prosody and the words themselves. Using a
large spoken corpus of English audiobooks, we extract prosodic features aligned
to individual words and test how well they can be predicted from LLM
embeddings, compared to non-contextual word embeddings. We find a high degree
of redundancy between the information carried by the words and prosodic
information across several prosodic features, including intensity, duration,
pauses, and pitch contours. Furthermore, a word's prosodic information is
redundant with both the word itself and the context preceding as well as
following it. Still, we observe that prosodic features can not be fully
predicted from text, suggesting that prosody carries information above and
beyond the words. Along with this paper, we release a general-purpose data
processing pipeline for quantifying the relationship between linguistic
information and extra-linguistic features.Comment: Published at The 2023 Conference on Empirical Methods in Natural
Language Processing (EMNLP
DPP-TTS: Diversifying prosodic features of speech via determinantal point processes
With the rapid advancement in deep generative models, recent neural
Text-To-Speech(TTS) models have succeeded in synthesizing human-like speech.
There have been some efforts to generate speech with various prosody beyond
monotonous prosody patterns. However, previous works have several limitations.
First, typical TTS models depend on the scaled sampling temperature for
boosting the diversity of prosody. Speech samples generated at high sampling
temperatures often lack perceptual prosodic diversity, which can adversely
affect the naturalness of the speech. Second, the diversity among samples is
neglected since the sampling procedure often focuses on a single speech sample
rather than multiple ones. In this paper, we propose DPP-TTS: a text-to-speech
model based on Determinantal Point Processes (DPPs) with a prosody diversifying
module. Our TTS model is capable of generating speech samples that
simultaneously consider perceptual diversity in each sample and among multiple
samples. We demonstrate that DPP-TTS generates speech samples with more
diversified prosody than baselines in the side-by-side comparison test
considering the naturalness of speech at the same time.Comment: EMNLP 202
The DeepZen Speech Synthesis System for Blizzard Challenge 2023
This paper describes the DeepZen text to speech (TTS) system for Blizzard
Challenge 2023. The goal of this challenge is to synthesise natural and
high-quality speech in French, from a large monospeaker dataset (hub task) and
from a smaller dataset by speaker adaptation (spoke task). We participated to
both tasks with the same model architecture. Our approach has been to use an
auto-regressive model, which retains an advantage for generating natural
sounding speech but to improve prosodic control in several ways. Similarly to
non-attentive Tacotron, the model uses a duration predictor and gaussian
upsampling at inference, but with a simpler unsupervised training. We also
model the speaking style at both sentence and word levels by extracting global
and local style tokens from the reference speech. At inference, the global and
local style tokens are predicted from a BERT model run on text. This BERT model
is also used to predict specific pronunciation features like schwa elision and
optional liaisons. Finally, a modified version of HifiGAN trained on a large
public dataset and fine-tuned on the target voices is used to generate speech
waveform. Our team is identified as O in the the Blizzard evaluation and MUSHRA
test results show that our system performs second ex aequo in both hub task
(median score of 0.75) and spoke task (median score of 0.68), over 18 and 14
participants, respectively.Comment: Blizzard Challenge 202
Word Importance Modeling to Enhance Captions Generated by Automatic Speech Recognition for Deaf and Hard of Hearing Users
People who are deaf or hard-of-hearing (DHH) benefit from sign-language interpreting or live-captioning (with a human transcriptionist), to access spoken information. However, such services are not legally required, affordable, nor available in many settings, e.g., impromptu small-group meetings in the workplace or online video content that has not been professionally captioned. As Automatic Speech Recognition (ASR) systems improve in accuracy and speed, it is natural to investigate the use of these systems to assist DHH users in a variety of tasks. But, ASR systems are still not perfect, especially in realistic conversational settings, leading to the issue of trust and acceptance of these systems from the DHH community. To overcome these challenges, our work focuses on: (1) building metrics for accurately evaluating the quality of automatic captioning systems, and (2) designing interventions for improving the usability of captions for DHH users.
The first part of this dissertation describes our research on methods for identifying words that are important for understanding the meaning of a conversational turn within transcripts of spoken dialogue. Such knowledge about the relative importance of words in spoken messages can be used in evaluating ASR systems (in part 2 of this dissertation) or creating new applications for DHH users of captioned video (in part 3 of this dissertation). We found that models which consider both the acoustic properties of spoken words as well as text-based features (e.g., pre-trained word embeddings) are more effective at predicting the semantic importance of a word than models that utilize only one of these types of features.
The second part of this dissertation describes studies to understand DHH users\u27 perception of the quality of ASR-generated captions; the goal of this work was to validate the design of automatic metrics for evaluating captions in real-time applications for these users. Such a metric could facilitate comparison of various ASR systems, for determining the suitability of specific ASR systems for supporting communication for DHH users. We designed experimental studies to elicit feedback on the quality of captions from DHH users, and we developed and evaluated automatic metrics for predicting the usability of automatically generated captions for these users. We found that metrics that consider the importance of each word in a text are more effective at predicting the usability of imperfect text captions than the traditional Word Error Rate (WER) metric.
The final part of this dissertation describes research on importance-based highlighting of words in captions, as a way to enhance the usability of captions for DHH users. Similar to highlighting in static texts (e.g., textbooks or electronic documents), highlighting in captions involves changing the appearance of some texts in caption to enable readers to attend to the most important bits of information quickly. Despite the known benefits of highlighting in static texts, research on the usefulness of highlighting in captions for DHH users is largely unexplored. For this reason, we conducted experimental studies with DHH participants to understand the benefits of importance-based highlighting in captions, and their preference on different design configurations for highlighting in captions. We found that DHH users subjectively preferred highlighting in captions, and they reported higher readability and understandability scores and lower task-load scores when viewing videos with captions containing highlighting compared to the videos without highlighting. Further, in partial contrast to recommendations in prior research on highlighting in static texts (which had not been based on experimental studies with DHH users), we found that DHH participants preferred boldface, word-level, non-repeating highlighting in captions
Prosodic prominence in English
In English, certain words are perceptually more salient than other neighboring words. The perceptual salience is signaled by acoustic cues. Prominent words are higher, longer, or louder than nonprominent words in English. Perceptual prominence is associated with meaning of a word in discourse context. Prominent words are usually new or contrastive information, while nonprominent words are given or noncontrastive information. This dissertation addresses English prominence in two separate studies. The first study investigates the prosodic prominence in relation to pitch accents, acoustic cues, and discourse meaning of a word in a public speech. The second study examines the cognitive representation of prosodic contour in a corpus of imitation.
Linguists claim that the information status of a word determines the types of pitch accents in English. Prior research informs us about prominence (1) in relation to the binary given-new distinction of lexical givenness, and (2) in minimally contextualized utterances such as question-answer prompts or excerpts from a corpus. The assignment of prominence, however, can vary in relation to referential meaning as well as lexical meaning of a word in natural, more contextualized speech. This study examines the prosodic prominence as a function of pitch accents, acoustic cues, and information status in a complete public speech. Information status is considered in relation to referential, lexical givenness and alternative-based contrastive focus. The results show that accent type is probabilistically associated with information status in this speech. The accent assignment differs between referentially vs. lexically given words. Despite the weak relationship between information status and pitch accents in the speech of the speaker, non-expert listeners perceive prominence as expected: they are more likely to perceive prominence on words carrying new or contrastive information or words with high or bitonal pitch accents. Surprisingly, the listeners perceive acoustic cues differently depending on the information status or accent types of a word. Based on these results, the first study suggests that (1) the relationship between information status and accent type is not deterministic in English, (2) lexical givenness differs from referential givenness in production and perception of prominence, and (3) perceived prominence is influenced by information status, pitch accents, acoustic cues, and their interaction.
The second study examines how an intonational contour is represented in the mental lexicon of English speakers. Some linguists find that speakers are able to reproduce the phonetic details of intonational features, while in other research speakers are better at reproducing intonational features than imitating phonetic details of an utterance. This study investigates the domain of intonational encoding by comparing several prosodic domains in imitated utterances. I hypothesize that the domain which best captures the similarity of intonational contour between the model speaker and imitators is the target of imitation, and that imitation can be considered as the domain of intonational encoding in cognitive representation. The results show that the f0 distance between the model speaker and imitators is best explained over an intermediate phrase. Based on these results, the second study proposes that speakers encode a time-varying f0 contour over a prosodic phrase in their mental lexicon and supports the exemplar encoding of intonational contour
- …