2 research outputs found
DeepFry: Identifying Vocal Fry Using Deep Neural Networks
Vocal fry or creaky voice refers to a voice quality characterized by
irregular glottal opening and low pitch. It occurs in diverse languages and is
prevalent in American English, where it is used not only to mark phrase
finality, but also sociolinguistic factors and affect. Due to its irregular
periodicity, creaky voice challenges automatic speech processing and
recognition systems, particularly for languages where creak is frequently used.
This paper proposes a deep learning model to detect creaky voice in fluent
speech. The model is composed of an encoder and a classifier trained together.
The encoder takes the raw waveform and learns a representation using a
convolutional neural network. The classifier is implemented as a multi-headed
fully-connected network trained to detect creaky voice, voicing, and pitch,
where the last two are used to refine creak prediction. The model is trained
and tested on speech of American English speakers, annotated for creak by
trained phoneticians.
We evaluated the performance of our system using two encoders: one is
tailored for the task, and the other is based on a state-of-the-art
unsupervised representation. Results suggest our best-performing system has
improved recall and F1 scores compared to previous methods on unseen data.Comment: under submission to Interspeech 202