11 research outputs found
OverFlow: Putting flows on top of neural transducers for better TTS
Neural HMMs are a type of neural transducer recently proposed for
sequence-to-sequence modelling in text-to-speech. They combine the best
features of classic statistical speech synthesis and modern neural TTS,
requiring less data and fewer training updates, and are less prone to gibberish
output caused by neural attention failures. In this paper, we combine neural
HMM TTS with normalising flows for describing the highly non-Gaussian
distribution of speech acoustics. The result is a powerful, fully probabilistic
model of durations and acoustics that can be trained using exact maximum
likelihood. Experiments show that a system based on our proposal needs fewer
updates than comparable methods to produce accurate pronunciations and a
subjective speech quality close to natural speech. Please see
https://shivammehta25.github.io/OverFlow/ for audio examples and code.Comment: 5 pages, 2 figures. Accepted for publication at Interspeech 202
Enhanced response to music in pregnancy
Given a possible effect of estrogen on the pleasure-mediating dopaminergic system, musical appreciation in participants whose estrogen levels are naturally elevated during the oral contraceptive cycle and pregnancy has been investigated (n = 32, 15 pregnant, 17 nonpregnant; mean age 27.2). Results show more pronounced blood pressure responses to music in pregnant women. However, estrogen level differences during different phases of oral contraceptive intake did not have any effect, indicating that the observed changes were not related to estrogen. Effects of music on blood pressure were independent of valence, and dissonance elicited the greatest drop in blood pressure. Thus, the enhanced physiological response in pregnant women probably does not reflect a protective mechanism to avoid unpleasantness. Instead, this enhanced response is discussed in terms of a facilitation of prenatal conditioning to acoustical (musical) stimuli
En utforskning av neurala aspekter av turtagning i spontant samtal
This project added to the sparse body of research on the neural underpinnings of turn-taking with an electroencephalography (EEG) investigation of spontaneous conversation. Eighteen participants (3 male, 15 female, mean age 29.79), recruited and participating in pairs, underwent EEG hyperscanning as they conversed on a freely chosen topic for 45 minutes. In line with previous research, it was predicted that a time-frequency analysis of the EEG might reveal either increased power at around 10 Hz (the location of one of two components of the mu rhythm, an oscillation possibly involved in motor preparation for speech), or reduced alpha (8-12 Hz) power (reflecting non-motor aspects of turn preparation) prior to taking oneâs turn. Increased power between 8-12 Hz was observed around 1.5 and 1 second preceding turn-taking, but similar power increases also occurred prior to turn-yielding and the conversation partner continuing after a pause, and a reduction in alpha power was found in turn-taking relative to listening to the other speaker continue after a pause. It is unclear whether this activity reflected motor or non-motor aspects of turn preparation, but the spontaneous conversation paradigm proved feasible for investigating brain activity coupled to turn-taking despite the methodological obstacles.Detta forskningsprojekt bedrar till ett Ă€mne dĂ€r relativt fĂ„ studier har genomförts med en elektroencefalografi- (EEG-) undersökning av hjĂ€rnaktivitet som Ă€r kopplad till turtagning i spontant samtal. Arton deltagare (3 mĂ€n, 15 kvinnor, medelĂ„lder 29,79) som rekryterades och deltog i par, genomgick EEG-hyperscanning medan de pratade om ett fritt valt Ă€mne i 45 minuter. Det förutsades att en tidsfrekvensanalys av EEG kan avslöja antingen ökad effekt vid cirka 10 Hz (vilket motsvarar en av tvĂ„ komponenter i mu-rytmen, en oscillation som eventuellt Ă€r involverad i motoriska förberedelser för tal) eller reducerad alfaeffekt (8 -12 Hz) (vilket möjligen Ă„terspeglar icke-motoriska aspekter av turtagningsförberedelser) innan man tar sin tur. Ăkad effekt mellan 8-12 Hz observerades ungefĂ€r 1,5 och 1 sekund före turtagning, men liknande ökningar intrĂ€ffade ocksĂ„ innan samtalspartnern tog sin tur eller fortsatte efter en paus, och en minskning av alfaeffekt observerades nĂ€r turtagning jĂ€mfördes till kontexter dĂ€r försökspersonerna lyssnade nĂ€r den andra talaren fortsatte efter en paus. Det Ă€r oklart om denna aktivitet Ă„terspeglade motoriska eller icke-motoriska aspekter av turtagningsförberedelser, men det visar sig vara möjligt att undersöka hjĂ€rnaktivitet kopplad till spontant samtal pĂ„ ett rimligt sĂ€tt trots paradigmens metodologiska svĂ„righeter.Hidden events in turn-takin
Two Pragmatic Functions of Breathy Voice in American English Conversation
Although the paralinguistic and phonological significance of breathy voice is well known, its pragmatic roles have been little studied. We report a systematic exploration of the pragmatic functions of breathy voice in American English, using a small corpus of casual conversations, using the Cepstral Peak Prominence Smoothed measure as an indicator of breathy voice, and using a common workflow to find prosodic constructions and identify their meanings. We found two prosodic constructions involving breathy voice. The first involves a short region of breathy voice in the midst of a region of low pitch, functioning to mark self-directed speech. The second involves breathy voice over several seconds, combined with a moment of wider pitch range leading to a high pitch over about a second, functioning to mark an attempt to establish common ground. These interpretations were confirmed by a perception experiment.QC 20220628Perception of speaker stance â using spontaneous speech synthesis to explore the contribution of prosody, context and speaker (VR-2020-02396)Prosodic functions of voice quality dynamics(VR-2019-02932)CAPTivating â Comparative Analysis of Public speaking with Text-to-speech (P20-0298
Perception of smiling voice in spontaneous speech synthesis
Smiling during speech production has been shown to result in perceptible acoustic differences compared to non-smiling speech. However, there is a scarcity of research on the perception of âsmiling voiceâ in synthesized spontaneous speech. In this study, we used a sequence-to-sequence neural text-tospeech system built on conversational data to produce utterances with the characteristics of spontaneous speech. Segments of speech following laughter, and the same utterances not preceded by laughter, were compared in a perceptual experiment after removing laughter and/or breaths from the beginning of the utterance to determine whether participants perceive the utterances preceded by laughter as sounding as if they were produced while smiling. The results showed that participants identified the post-laughter speech as smiling at a rate significantly greater than chance. Furthermore, the effect of content (positive/neutral/negative) was investigated. These results show that laughter, a spontaneous, non-elicited phenomenon in our modelâs training data, can be used to synthesize expressive speech with the perceptual characteristics of smiling.QC 20230616</p
Evaluating the impact of disfluencies on the perception of speaker competence using neural speech synthesis
CAPTivating â Comparative Analysis of Public Speaking with Text-to-Speech (P20-0298
Evaluating the impact of disfluencies on the perception of speaker competence using neural speech synthesis
CAPTivating â Comparative Analysis of Public Speaking with Text-to-Speech (P20-0298
Speech Data Augmentation for Improving Phoneme Transcriptions of Aphasic Speech Using Wav2Vec 2.0 for the PSST Challenge
As part of the PSST challenge, we explore how data augmentations, data sources, and model size affect phoneme transcription accuracy on speech produced by individuals with aphasia. We evaluate model performance in terms of feature error rate (FER) and phoneme error rate (PER). We find that data augmentations techniques, such as pitch shift, improve model performance. Additionally, increasing the size of the model decreases FER and PER. Our experiments also show that adding manually-transcribed speech from non-aphasic speakers (TIMIT) improves performance when Room Impulse Response is used to augment the data. The best performing model combines aphasic and non-aphasic data and has a 21.0% PER and a 9.2% FER, a relative improvement of 9.8% compared to the baseline model on the primary outcome measurement. We show that data augmentation, larger model size, and additional non-aphasic data sources can be helpful in improving automatic phoneme recognition models for people with aphasia.QC 20220815</p