448 research outputs found
Prosody-Based Automatic Segmentation of Speech into Sentences and Topics
A crucial step in processing speech audio data for information extraction,
topic detection, or browsing/playback is to segment the input into sentence and
topic units. Speech segmentation is challenging, since the cues typically
present for segmenting text (headers, paragraphs, punctuation) are absent in
spoken language. We investigate the use of prosody (information gleaned from
the timing and melody of speech) for these tasks. Using decision tree and
hidden Markov modeling techniques, we combine prosodic cues with word-based
approaches, and evaluate performance on two speech corpora, Broadcast News and
Switchboard. Results show that the prosodic model alone performs on par with,
or better than, word-based statistical language models -- for both true and
automatically recognized words in news speech. The prosodic model achieves
comparable performance with significantly less training data, and requires no
hand-labeling of prosodic events. Across tasks and corpora, we obtain a
significant improvement over word-only models using a probabilistic combination
of prosodic and lexical information. Inspection reveals that the prosodic
models capture language-independent boundary indicators described in the
literature. Finally, cue usage is task and corpus dependent. For example, pause
and pitch features are highly informative for segmenting news speech, whereas
pause, duration and word-based cues dominate for natural conversation.Comment: 30 pages, 9 figures. To appear in Speech Communication 32(1-2),
Special Issue on Accessing Information in Spoken Audio, September 200
The Perception of Creaky Voice: Does Speaker Gender Affect our Judgments?
This study focuses on the phonetics of creaky voice saliency and the perceptual sociolinguistic indexes that are evoked during creaky voice use. This study consists of two experiments: the first a listener judgment based Likert scale, the second an AXB study. The first experiment used modal and creaky voice statement-of-fact tokens to determine whether the speaker is or isn’t x characteristic (intelligent, feminine, educated, masculine, hesitant, and confident). This study found that both male and female speakers were found to be less intelligent, less educated, less feminine, more masculine, less confident, and more hesitant when using creaky voice phonation as compared to the modal register. Participants also rated male and female speakers as statistically different. During the second experiment the participants listened to continuums that went from modal register to extreme creaky voice (based on F0 levels). Participants performed an AXB task to determine ability at distinguishing levels of creaky voice along the continuum. This study found that participants were less able to correctly detect the level of creaky voice in the female speaker for the lower half of the continuum when compared to the male speaker
DeepFry: Identifying Vocal Fry Using Deep Neural Networks
Vocal fry or creaky voice refers to a voice quality characterized by
irregular glottal opening and low pitch. It occurs in diverse languages and is
prevalent in American English, where it is used not only to mark phrase
finality, but also sociolinguistic factors and affect. Due to its irregular
periodicity, creaky voice challenges automatic speech processing and
recognition systems, particularly for languages where creak is frequently used.
This paper proposes a deep learning model to detect creaky voice in fluent
speech. The model is composed of an encoder and a classifier trained together.
The encoder takes the raw waveform and learns a representation using a
convolutional neural network. The classifier is implemented as a multi-headed
fully-connected network trained to detect creaky voice, voicing, and pitch,
where the last two are used to refine creak prediction. The model is trained
and tested on speech of American English speakers, annotated for creak by
trained phoneticians.
We evaluated the performance of our system using two encoders: one is
tailored for the task, and the other is based on a state-of-the-art
unsupervised representation. Results suggest our best-performing system has
improved recall and F1 scores compared to previous methods on unseen data.Comment: under submission to Interspeech 202
HMM-based synthesis of creaky voice
Creaky voice, also referred to as vocal fry, is a voice quality frequently produced in many languages, in both read and conversational speech. To enhance the naturalness of speech synthesis, these latter should be able to generate speech in all its expressive diversity, including creaky voice. The present study looks to exploit our recent developments, including creaky voice detection, prediction of creaky voice from context, and rendering of the creaky excitation, into a fully functioning and automatic HMM-based synthesis system. HMM-based synthetic creaky voices are built and evaluated in subjective listening tests, which show that the best synthetic creaky voices are rated more natural and more creaky compared to a conventional voice. A noncreaky voice is also successfully transformed to use creak by modifying the F0 contour and excitation of the predicted creaky parts. The transformed voice is rated equal in terms of naturalness and clearly more creaky compared to the original voice. Index Terms: speech synthesis, creaky voice, contextual factors, F0 estimation, excitation modelin
Acoustic and linguistic interdependencies of irregular phonation
Thesis (M. Eng.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2010.Cataloged from PDF version of thesis.Includes bibliographical references (p. 57-58).Irregular phonation is a commonly occurring but only partially understood phenomenon of human speech production. We know properties of irregular phonation can be clues to a speaker's dialect and even identity. We also have evidence that irregular phonation is used as a signal of linguistic and acoustic intent. Nonetheless, there remain fundamental questions about the nature of irregular phonation and the interdependencies of irregular phonation with acoustic and linguistic speech characteristics, as well as the implications of this relationship for speech processing applications. In this thesis, we hypothesize that irregular phonation occurs naturally in situations with large amounts of change in pitch or power. We therefore focus on investigating parameters such as pitch variance and power variance as well as other measurable properties involving speech dynamics. In this work, we have investigated the frequency and structure of irregular phonation, the acoustic characteristics of the TIMIT Acoustic-Phonetic Speech Corpus, and relationships between these two groups. We show that characteristics of irregular phonation are positively correlated with several of our potential predictors including pitch and power variance. Finally, we demonstrate that these correlations lead to a model with the potential to predict the occurrence and properties of irregular phonation.by Kimberly F. Dietz.M.Eng
- …