448 research outputs found

    Prosody-Based Automatic Segmentation of Speech into Sentences and Topics

    Get PDF
    A crucial step in processing speech audio data for information extraction, topic detection, or browsing/playback is to segment the input into sentence and topic units. Speech segmentation is challenging, since the cues typically present for segmenting text (headers, paragraphs, punctuation) are absent in spoken language. We investigate the use of prosody (information gleaned from the timing and melody of speech) for these tasks. Using decision tree and hidden Markov modeling techniques, we combine prosodic cues with word-based approaches, and evaluate performance on two speech corpora, Broadcast News and Switchboard. Results show that the prosodic model alone performs on par with, or better than, word-based statistical language models -- for both true and automatically recognized words in news speech. The prosodic model achieves comparable performance with significantly less training data, and requires no hand-labeling of prosodic events. Across tasks and corpora, we obtain a significant improvement over word-only models using a probabilistic combination of prosodic and lexical information. Inspection reveals that the prosodic models capture language-independent boundary indicators described in the literature. Finally, cue usage is task and corpus dependent. For example, pause and pitch features are highly informative for segmenting news speech, whereas pause, duration and word-based cues dominate for natural conversation.Comment: 30 pages, 9 figures. To appear in Speech Communication 32(1-2), Special Issue on Accessing Information in Spoken Audio, September 200

    The Perception of Creaky Voice: Does Speaker Gender Affect our Judgments?

    Get PDF
    This study focuses on the phonetics of creaky voice saliency and the perceptual sociolinguistic indexes that are evoked during creaky voice use. This study consists of two experiments: the first a listener judgment based Likert scale, the second an AXB study. The first experiment used modal and creaky voice statement-of-fact tokens to determine whether the speaker is or isn’t x characteristic (intelligent, feminine, educated, masculine, hesitant, and confident). This study found that both male and female speakers were found to be less intelligent, less educated, less feminine, more masculine, less confident, and more hesitant when using creaky voice phonation as compared to the modal register. Participants also rated male and female speakers as statistically different. During the second experiment the participants listened to continuums that went from modal register to extreme creaky voice (based on F0 levels). Participants performed an AXB task to determine ability at distinguishing levels of creaky voice along the continuum. This study found that participants were less able to correctly detect the level of creaky voice in the female speaker for the lower half of the continuum when compared to the male speaker

    DeepFry: Identifying Vocal Fry Using Deep Neural Networks

    Full text link
    Vocal fry or creaky voice refers to a voice quality characterized by irregular glottal opening and low pitch. It occurs in diverse languages and is prevalent in American English, where it is used not only to mark phrase finality, but also sociolinguistic factors and affect. Due to its irregular periodicity, creaky voice challenges automatic speech processing and recognition systems, particularly for languages where creak is frequently used. This paper proposes a deep learning model to detect creaky voice in fluent speech. The model is composed of an encoder and a classifier trained together. The encoder takes the raw waveform and learns a representation using a convolutional neural network. The classifier is implemented as a multi-headed fully-connected network trained to detect creaky voice, voicing, and pitch, where the last two are used to refine creak prediction. The model is trained and tested on speech of American English speakers, annotated for creak by trained phoneticians. We evaluated the performance of our system using two encoders: one is tailored for the task, and the other is based on a state-of-the-art unsupervised representation. Results suggest our best-performing system has improved recall and F1 scores compared to previous methods on unseen data.Comment: under submission to Interspeech 202

    HMM-based synthesis of creaky voice

    Get PDF
    Creaky voice, also referred to as vocal fry, is a voice quality frequently produced in many languages, in both read and conversational speech. To enhance the naturalness of speech synthesis, these latter should be able to generate speech in all its expressive diversity, including creaky voice. The present study looks to exploit our recent developments, including creaky voice detection, prediction of creaky voice from context, and rendering of the creaky excitation, into a fully functioning and automatic HMM-based synthesis system. HMM-based synthetic creaky voices are built and evaluated in subjective listening tests, which show that the best synthetic creaky voices are rated more natural and more creaky compared to a conventional voice. A noncreaky voice is also successfully transformed to use creak by modifying the F0 contour and excitation of the predicted creaky parts. The transformed voice is rated equal in terms of naturalness and clearly more creaky compared to the original voice. Index Terms: speech synthesis, creaky voice, contextual factors, F0 estimation, excitation modelin

    Acoustic and linguistic interdependencies of irregular phonation

    Get PDF
    Thesis (M. Eng.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2010.Cataloged from PDF version of thesis.Includes bibliographical references (p. 57-58).Irregular phonation is a commonly occurring but only partially understood phenomenon of human speech production. We know properties of irregular phonation can be clues to a speaker's dialect and even identity. We also have evidence that irregular phonation is used as a signal of linguistic and acoustic intent. Nonetheless, there remain fundamental questions about the nature of irregular phonation and the interdependencies of irregular phonation with acoustic and linguistic speech characteristics, as well as the implications of this relationship for speech processing applications. In this thesis, we hypothesize that irregular phonation occurs naturally in situations with large amounts of change in pitch or power. We therefore focus on investigating parameters such as pitch variance and power variance as well as other measurable properties involving speech dynamics. In this work, we have investigated the frequency and structure of irregular phonation, the acoustic characteristics of the TIMIT Acoustic-Phonetic Speech Corpus, and relationships between these two groups. We show that characteristics of irregular phonation are positively correlated with several of our potential predictors including pitch and power variance. Finally, we demonstrate that these correlations lead to a model with the potential to predict the occurrence and properties of irregular phonation.by Kimberly F. Dietz.M.Eng
    • …
    corecore