Search CORE

448 research outputs found

Improved automatic detection of creak

Author: Blomgren
Breiman
Böhm
Böhm
Campbell
Carlson
Christer Gobl
Degottex
Deshmukh
Drugman
Drugman
Drugman
Drugman
Drugman
Drugman
Edlund
Edmondson
Elliot
Espy-Wilson
Gerratt
Ghosh
Gobl
Gobl
Heldner
Hollien
Ishi
Ishi
Ishi
Ishi
John Kane
Kane
Kay
Laver
Laver
Magnuson
Moisik
Ogden
Ogden
Scherer
Silen
Slifka
Surana
Surana
Thomas Drugman
Titze
Titze
Varga
Villavicencio
Vishnubhotla
Wolk
Yanushevskaya
Yu
Yuasa
Publication venue: 'Elsevier BV'
Publication date
Field of study

Prosody-Based Automatic Segmentation of Speech into Sentences and Topics

Author: Andreas Stolcke
Bahl
Baum
Breiman
Brown
Bruce
Buntine
Dermatas
Dilek Hakkani-Tür
Elizabeth Shriberg
Gökhan Tür
Hearst
Katz
Palmer
Shriberg
Sluijter
Swerts
Swerts
Swerts
Thorsen
Viterbi
Publication venue
Publication date: 01/01/2000
Field of study

A crucial step in processing speech audio data for information extraction, topic detection, or browsing/playback is to segment the input into sentence and topic units. Speech segmentation is challenging, since the cues typically present for segmenting text (headers, paragraphs, punctuation) are absent in spoken language. We investigate the use of prosody (information gleaned from the timing and melody of speech) for these tasks. Using decision tree and hidden Markov modeling techniques, we combine prosodic cues with word-based approaches, and evaluate performance on two speech corpora, Broadcast News and Switchboard. Results show that the prosodic model alone performs on par with, or better than, word-based statistical language models -- for both true and automatically recognized words in news speech. The prosodic model achieves comparable performance with significantly less training data, and requires no hand-labeling of prosodic events. Across tasks and corpora, we obtain a significant improvement over word-only models using a probabilistic combination of prosodic and lexical information. Inspection reveals that the prosodic models capture language-independent boundary indicators described in the literature. Finally, cue usage is task and corpus dependent. For example, pause and pitch features are highly informative for segmenting news speech, whereas pause, duration and word-based cues dominate for natural conversation.Comment: 30 pages, 9 figures. To appear in Speech Communication 32(1-2), Special Issue on Accessing Information in Spoken Audio, September 200

arXiv.org e-Print Archive

CiteSeerX

Crossref

Bilkent University Institutional Repository

The Perception of Creaky Voice: Does Speaker Gender Affect our Judgments?

Author: Lee Kaitlyn E.
Publication venue: UKnowledge
Publication date: 01/01/2016
Field of study

This study focuses on the phonetics of creaky voice saliency and the perceptual sociolinguistic indexes that are evoked during creaky voice use. This study consists of two experiments: the first a listener judgment based Likert scale, the second an AXB study. The first experiment used modal and creaky voice statement-of-fact tokens to determine whether the speaker is or isn’t x characteristic (intelligent, feminine, educated, masculine, hesitant, and confident). This study found that both male and female speakers were found to be less intelligent, less educated, less feminine, more masculine, less confident, and more hesitant when using creaky voice phonation as compared to the modal register. Participants also rated male and female speakers as statistically different. During the second experiment the participants listened to continuums that went from modal register to extreme creaky voice (based on F0 levels). Participants performed an AXB task to determine ability at distinguishing levels of creaky voice along the continuum. This study found that participants were less able to correctly detect the level of creaky voice in the female speaker for the lower half of the continuum when compared to the male speaker

University of Kentucky

DeepFry: Identifying Vocal Fry Using Deep Neural Networks

Author: Chernyak Bronya R.
Chodroff Eleanor
Cole Jennifer S.
Keshet Joseph
Segal Yael
Simon Talia Ben
Steffman Jeremy
Publication venue
Publication date: 31/03/2022
Field of study

Vocal fry or creaky voice refers to a voice quality characterized by irregular glottal opening and low pitch. It occurs in diverse languages and is prevalent in American English, where it is used not only to mark phrase finality, but also sociolinguistic factors and affect. Due to its irregular periodicity, creaky voice challenges automatic speech processing and recognition systems, particularly for languages where creak is frequently used. This paper proposes a deep learning model to detect creaky voice in fluent speech. The model is composed of an encoder and a classifier trained together. The encoder takes the raw waveform and learns a representation using a convolutional neural network. The classifier is implemented as a multi-headed fully-connected network trained to detect creaky voice, voicing, and pitch, where the last two are used to refine creak prediction. The model is trained and tested on speech of American English speakers, annotated for creak by trained phoneticians. We evaluated the performance of our system using two encoders: one is tailored for the task, and the other is based on a state-of-the-art unsupervised representation. Results suggest our best-performing system has improved recall and F1 scores compared to previous methods on unseen data.Comment: under submission to Interspeech 202

arXiv.org e-Print Archive

HMM-based synthesis of creaky voice

Author: Drugman Thomas
Gobl Christer
Kane John
Raitio Tuomo
Publication venue
Publication date: 01/01/2013
Field of study

Creaky voice, also referred to as vocal fry, is a voice quality frequently produced in many languages, in both read and conversational speech. To enhance the naturalness of speech synthesis, these latter should be able to generate speech in all its expressive diversity, including creaky voice. The present study looks to exploit our recent developments, including creaky voice detection, prediction of creaky voice from context, and rendering of the creaky excitation, into a fully functioning and automatic HMM-based synthesis system. HMM-based synthetic creaky voices are built and evaluated in subjective listening tests, which show that the best synthetic creaky voices are rated more natural and more creaky compared to a conventional voice. A noncreaky voice is also successfully transformed to use creak by modifying the F0 contour and excitation of the predicted creaky parts. The transformed voice is rated equal in terms of naturalness and clearly more creaky compared to the original voice. Index Terms: speech synthesis, creaky voice, contextual factors, F0 estimation, excitation modelin

CiteSeerX

Edinburgh Research Explorer

Automated measures of dysphonias and the phonatory effects of asymmetries in the posterior larynx

Author: Vieira Maurilio Nunes
Publication venue: The University of Edinburgh
Publication date: 01/01/1997
Field of study

Edinburgh Research Archive

Acoustic and linguistic interdependencies of irregular phonation

Author: Dietz Kimberly F
Publication venue: Massachusetts Institute of Technology
Publication date: 01/01/2010
Field of study

Thesis (M. Eng.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2010.Cataloged from PDF version of thesis.Includes bibliographical references (p. 57-58).Irregular phonation is a commonly occurring but only partially understood phenomenon of human speech production. We know properties of irregular phonation can be clues to a speaker's dialect and even identity. We also have evidence that irregular phonation is used as a signal of linguistic and acoustic intent. Nonetheless, there remain fundamental questions about the nature of irregular phonation and the interdependencies of irregular phonation with acoustic and linguistic speech characteristics, as well as the implications of this relationship for speech processing applications. In this thesis, we hypothesize that irregular phonation occurs naturally in situations with large amounts of change in pitch or power. We therefore focus on investigating parameters such as pitch variance and power variance as well as other measurable properties involving speech dynamics. In this work, we have investigated the frequency and structure of irregular phonation, the acoustic characteristics of the TIMIT Acoustic-Phonetic Speech Corpus, and relationships between these two groups. We show that characteristics of irregular phonation are positively correlated with several of our potential predictors including pitch and power variance. Finally, we demonstrate that these correlations lead to a model with the potential to predict the occurrence and properties of irregular phonation.by Kimberly F. Dietz.M.Eng

DSpace@MIT