232 research outputs found

    Predicting Prosodic Prominence from Text with Pre-trained Contextualized Word Representations

    Get PDF
    In this paper we introduce a new natural language processing dataset and benchmark for predicting prosodic prominence from written text. To our knowledge this will be the largest publicly available dataset with prosodic labels. We describe the dataset construction and the resulting benchmark dataset in detail and train a number of different models ranging from feature-based classifiers to neural network systems for the prediction of discretized prosodic prominence. We show that pre-trained contextualized word representations from BERT outperform the other models even with less than 10% of the training data. Finally we discuss the dataset in light of the results and point to future research and plans for further improving both the dataset and methods of predicting prosodic prominence from text. The dataset and the code for the models are publicly available.Peer reviewe

    Prosodic Prominence and Boundaries in Sequence-to-Sequence Speech Synthesis

    Get PDF
    Recent advances in deep learning methods have elevated synthetic speech quality to human level, and the field is now moving towards addressing prosodic variation in synthetic speech.Despite successes in this effort, the state-of-the-art systems fall short of faithfully reproducing local prosodic events that give rise to, e.g., word-level emphasis and phrasal structure. This type of prosodic variation often reflects long-distance semantic relationships that are not accessible for end-to-end systems with a single sentence as their synthesis domain. One of the possible solutions might be conditioning the synthesized speech by explicit prosodic labels, potentially generated using longer portions of text. In this work we evaluate whether augmenting the textual input with such prosodic labels capturing word-level prominence and phrasal boundary strength can result in more accurate realization of sentence prosody. We use an automatic wavelet-based technique to extract such labels from speech material, and use them as an input to a tacotron-like synthesis system alongside textual information. The results of objective evaluation of synthesized speech show that using the prosodic labels significantly improves the output in terms of faithfulness of f0 and energy contours, in comparison with state-of-the-art implementations.Peer reviewe

    Quantifying the redundancy between prosody and text

    Full text link
    Prosody -- the suprasegmental component of speech, including pitch, loudness, and tempo -- carries critical aspects of meaning. However, the relationship between the information conveyed by prosody vs. by the words themselves remains poorly understood. We use large language models (LLMs) to estimate how much information is redundant between prosody and the words themselves. Using a large spoken corpus of English audiobooks, we extract prosodic features aligned to individual words and test how well they can be predicted from LLM embeddings, compared to non-contextual word embeddings. We find a high degree of redundancy between the information carried by the words and prosodic information across several prosodic features, including intensity, duration, pauses, and pitch contours. Furthermore, a word's prosodic information is redundant with both the word itself and the context preceding as well as following it. Still, we observe that prosodic features can not be fully predicted from text, suggesting that prosody carries information above and beyond the words. Along with this paper, we release a general-purpose data processing pipeline for quantifying the relationship between linguistic information and extra-linguistic features.Comment: Published at The 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP

    DPP-TTS: Diversifying prosodic features of speech via determinantal point processes

    Full text link
    With the rapid advancement in deep generative models, recent neural Text-To-Speech(TTS) models have succeeded in synthesizing human-like speech. There have been some efforts to generate speech with various prosody beyond monotonous prosody patterns. However, previous works have several limitations. First, typical TTS models depend on the scaled sampling temperature for boosting the diversity of prosody. Speech samples generated at high sampling temperatures often lack perceptual prosodic diversity, which can adversely affect the naturalness of the speech. Second, the diversity among samples is neglected since the sampling procedure often focuses on a single speech sample rather than multiple ones. In this paper, we propose DPP-TTS: a text-to-speech model based on Determinantal Point Processes (DPPs) with a prosody diversifying module. Our TTS model is capable of generating speech samples that simultaneously consider perceptual diversity in each sample and among multiple samples. We demonstrate that DPP-TTS generates speech samples with more diversified prosody than baselines in the side-by-side comparison test considering the naturalness of speech at the same time.Comment: EMNLP 202

    The DeepZen Speech Synthesis System for Blizzard Challenge 2023

    Full text link
    This paper describes the DeepZen text to speech (TTS) system for Blizzard Challenge 2023. The goal of this challenge is to synthesise natural and high-quality speech in French, from a large monospeaker dataset (hub task) and from a smaller dataset by speaker adaptation (spoke task). We participated to both tasks with the same model architecture. Our approach has been to use an auto-regressive model, which retains an advantage for generating natural sounding speech but to improve prosodic control in several ways. Similarly to non-attentive Tacotron, the model uses a duration predictor and gaussian upsampling at inference, but with a simpler unsupervised training. We also model the speaking style at both sentence and word levels by extracting global and local style tokens from the reference speech. At inference, the global and local style tokens are predicted from a BERT model run on text. This BERT model is also used to predict specific pronunciation features like schwa elision and optional liaisons. Finally, a modified version of HifiGAN trained on a large public dataset and fine-tuned on the target voices is used to generate speech waveform. Our team is identified as O in the the Blizzard evaluation and MUSHRA test results show that our system performs second ex aequo in both hub task (median score of 0.75) and spoke task (median score of 0.68), over 18 and 14 participants, respectively.Comment: Blizzard Challenge 202

    Word Importance Modeling to Enhance Captions Generated by Automatic Speech Recognition for Deaf and Hard of Hearing Users

    Get PDF
    People who are deaf or hard-of-hearing (DHH) benefit from sign-language interpreting or live-captioning (with a human transcriptionist), to access spoken information. However, such services are not legally required, affordable, nor available in many settings, e.g., impromptu small-group meetings in the workplace or online video content that has not been professionally captioned. As Automatic Speech Recognition (ASR) systems improve in accuracy and speed, it is natural to investigate the use of these systems to assist DHH users in a variety of tasks. But, ASR systems are still not perfect, especially in realistic conversational settings, leading to the issue of trust and acceptance of these systems from the DHH community. To overcome these challenges, our work focuses on: (1) building metrics for accurately evaluating the quality of automatic captioning systems, and (2) designing interventions for improving the usability of captions for DHH users. The first part of this dissertation describes our research on methods for identifying words that are important for understanding the meaning of a conversational turn within transcripts of spoken dialogue. Such knowledge about the relative importance of words in spoken messages can be used in evaluating ASR systems (in part 2 of this dissertation) or creating new applications for DHH users of captioned video (in part 3 of this dissertation). We found that models which consider both the acoustic properties of spoken words as well as text-based features (e.g., pre-trained word embeddings) are more effective at predicting the semantic importance of a word than models that utilize only one of these types of features. The second part of this dissertation describes studies to understand DHH users\u27 perception of the quality of ASR-generated captions; the goal of this work was to validate the design of automatic metrics for evaluating captions in real-time applications for these users. Such a metric could facilitate comparison of various ASR systems, for determining the suitability of specific ASR systems for supporting communication for DHH users. We designed experimental studies to elicit feedback on the quality of captions from DHH users, and we developed and evaluated automatic metrics for predicting the usability of automatically generated captions for these users. We found that metrics that consider the importance of each word in a text are more effective at predicting the usability of imperfect text captions than the traditional Word Error Rate (WER) metric. The final part of this dissertation describes research on importance-based highlighting of words in captions, as a way to enhance the usability of captions for DHH users. Similar to highlighting in static texts (e.g., textbooks or electronic documents), highlighting in captions involves changing the appearance of some texts in caption to enable readers to attend to the most important bits of information quickly. Despite the known benefits of highlighting in static texts, research on the usefulness of highlighting in captions for DHH users is largely unexplored. For this reason, we conducted experimental studies with DHH participants to understand the benefits of importance-based highlighting in captions, and their preference on different design configurations for highlighting in captions. We found that DHH users subjectively preferred highlighting in captions, and they reported higher readability and understandability scores and lower task-load scores when viewing videos with captions containing highlighting compared to the videos without highlighting. Further, in partial contrast to recommendations in prior research on highlighting in static texts (which had not been based on experimental studies with DHH users), we found that DHH participants preferred boldface, word-level, non-repeating highlighting in captions

    Prosodic prominence in English

    Get PDF
    In English, certain words are perceptually more salient than other neighboring words. The perceptual salience is signaled by acoustic cues. Prominent words are higher, longer, or louder than nonprominent words in English. Perceptual prominence is associated with meaning of a word in discourse context. Prominent words are usually new or contrastive information, while nonprominent words are given or noncontrastive information. This dissertation addresses English prominence in two separate studies. The first study investigates the prosodic prominence in relation to pitch accents, acoustic cues, and discourse meaning of a word in a public speech. The second study examines the cognitive representation of prosodic contour in a corpus of imitation. Linguists claim that the information status of a word determines the types of pitch accents in English. Prior research informs us about prominence (1) in relation to the binary given-new distinction of lexical givenness, and (2) in minimally contextualized utterances such as question-answer prompts or excerpts from a corpus. The assignment of prominence, however, can vary in relation to referential meaning as well as lexical meaning of a word in natural, more contextualized speech. This study examines the prosodic prominence as a function of pitch accents, acoustic cues, and information status in a complete public speech. Information status is considered in relation to referential, lexical givenness and alternative-based contrastive focus. The results show that accent type is probabilistically associated with information status in this speech. The accent assignment differs between referentially vs. lexically given words. Despite the weak relationship between information status and pitch accents in the speech of the speaker, non-expert listeners perceive prominence as expected: they are more likely to perceive prominence on words carrying new or contrastive information or words with high or bitonal pitch accents. Surprisingly, the listeners perceive acoustic cues differently depending on the information status or accent types of a word. Based on these results, the first study suggests that (1) the relationship between information status and accent type is not deterministic in English, (2) lexical givenness differs from referential givenness in production and perception of prominence, and (3) perceived prominence is influenced by information status, pitch accents, acoustic cues, and their interaction. The second study examines how an intonational contour is represented in the mental lexicon of English speakers. Some linguists find that speakers are able to reproduce the phonetic details of intonational features, while in other research speakers are better at reproducing intonational features than imitating phonetic details of an utterance. This study investigates the domain of intonational encoding by comparing several prosodic domains in imitated utterances. I hypothesize that the domain which best captures the similarity of intonational contour between the model speaker and imitators is the target of imitation, and that imitation can be considered as the domain of intonational encoding in cognitive representation. The results show that the f0 distance between the model speaker and imitators is best explained over an intermediate phrase. Based on these results, the second study proposes that speakers encode a time-varying f0 contour over a prosodic phrase in their mental lexicon and supports the exemplar encoding of intonational contour
    • …
    corecore