Search CORE

11 research outputs found

EAT, DRINK...

Author: Kirkland Ambika D.
Publication venue: The Cupola: Scholarship at Gettysburg College
Publication date: 01/01/2006
Field of study

Gettysburg College

The Shaman

Author: Kirkland Ambika D.
Publication venue: The Cupola: Scholarship at Gettysburg College
Publication date: 01/01/2006
Field of study

Gettysburg College

OverFlow: Putting flows on top of neural transducers for better TTS

Author: Beskow Jonas
Henter Gustav Eje
Kirkland Ambika
Lameris Harm
Mehta Shivam
Székely Éva
Publication venue
Publication date: 29/05/2023
Field of study

Neural HMMs are a type of neural transducer recently proposed for sequence-to-sequence modelling in text-to-speech. They combine the best features of classic statistical speech synthesis and modern neural TTS, requiring less data and fewer training updates, and are less prone to gibberish output caused by neural attention failures. In this paper, we combine neural HMM TTS with normalising flows for describing the highly non-Gaussian distribution of speech acoustics. The result is a powerful, fully probabilistic model of durations and acoustics that can be trained using exact maximum likelihood. Experiments show that a system based on our proposal needs fewer updates than comparable methods to produce accurate pronunciations and a subjective speech quality close to natural speech. Please see https://shivammehta25.github.io/OverFlow/ for audio examples and code.Comment: 5 pages, 2 figures. Accepted for publication at Interspeech 202

arXiv.org e-Print Archive

Enhanced response to music in pregnancy

Author: Ciupek Marian
Fritz Thomas
Guha Anika
Hoyer Jana
Ihme Klas
Kirkland Ambika
Villringer Arno
Publication venue: 'Wiley'
Publication date: 01/01/2014
Field of study

Given a possible effect of estrogen on the pleasure-mediating dopaminergic system, musical appreciation in participants whose estrogen levels are naturally elevated during the oral contraceptive cycle and pregnancy has been investigated (n = 32, 15 pregnant, 17 nonpregnant; mean age 27.2). Results show more pronounced blood pressure responses to music in pregnant women. However, estrogen level differences during different phases of oral contraceptive intake did not have any effect, indicating that the observed changes were not related to estrogen. Effects of music on blood pressure were independent of valence, and dissonance elicited the greatest drop in blood pressure. Thus, the enhanced physiological response in pregnant women probably does not reflect a protective mechanism to avoid unpleasantness. Instead, this enhanced response is discussed in terms of a facilitation of prenatal conditioning to acoustical (musical) stimuli

Crossref

Ghent University Academic Bibliography

MPG.PuRe

En utforskning av neurala aspekter av turtagning i spontant samtal

Author: Kirkland Ambika
Publication venue: Stockholms universitet, Institutionen för lingvistik
Publication date: 01/01/2020
Field of study

This project added to the sparse body of research on the neural underpinnings of turn-taking with an electroencephalography (EEG) investigation of spontaneous conversation. Eighteen participants (3 male, 15 female, mean age 29.79), recruited and participating in pairs, underwent EEG hyperscanning as they conversed on a freely chosen topic for 45 minutes. In line with previous research, it was predicted that a time-frequency analysis of the EEG might reveal either increased power at around 10 Hz (the location of one of two components of the mu rhythm, an oscillation possibly involved in motor preparation for speech), or reduced alpha (8-12 Hz) power (reflecting non-motor aspects of turn preparation) prior to taking one’s turn. Increased power between 8-12 Hz was observed around 1.5 and 1 second preceding turn-taking, but similar power increases also occurred prior to turn-yielding and the conversation partner continuing after a pause, and a reduction in alpha power was found in turn-taking relative to listening to the other speaker continue after a pause. It is unclear whether this activity reflected motor or non-motor aspects of turn preparation, but the spontaneous conversation paradigm proved feasible for investigating brain activity coupled to turn-taking despite the methodological obstacles.Detta forskningsprojekt bedrar till ett ämne där relativt få studier har genomförts med en elektroencefalografi- (EEG-) undersökning av hjärnaktivitet som är kopplad till turtagning i spontant samtal. Arton deltagare (3 män, 15 kvinnor, medelålder 29,79) som rekryterades och deltog i par, genomgick EEG-hyperscanning medan de pratade om ett fritt valt ämne i 45 minuter. Det förutsades att en tidsfrekvensanalys av EEG kan avslöja antingen ökad effekt vid cirka 10 Hz (vilket motsvarar en av två komponenter i mu-rytmen, en oscillation som eventuellt är involverad i motoriska förberedelser för tal) eller reducerad alfaeffekt (8 -12 Hz) (vilket möjligen återspeglar icke-motoriska aspekter av turtagningsförberedelser) innan man tar sin tur. Ökad effekt mellan 8-12 Hz observerades ungefär 1,5 och 1 sekund före turtagning, men liknande ökningar inträffade också innan samtalspartnern tog sin tur eller fortsatte efter en paus, och en minskning av alfaeffekt observerades när turtagning jämfördes till kontexter där försökspersonerna lyssnade när den andra talaren fortsatte efter en paus. Det är oklart om denna aktivitet återspeglade motoriska eller icke-motoriska aspekter av turtagningsförberedelser, men det visar sig vara möjligt att undersöka hjärnaktivitet kopplad till spontant samtal på ett rimligt sätt trots paradigmens metodologiska svårigheter.Hidden events in turn-takin

Publikationer från Stockholms universitet

Digitala Vetenskapliga Arkivet - Academic Archive On-line

Two Pragmatic Functions of Breathy Voice in American English Conversation

Author: Kirkland Ambika
Székely Éva
Ward Nigel
Wlodarczak Marcin
Publication venue: 'International Speech Communication Association'
Publication date: 01/01/2022
Field of study

Although the paralinguistic and phonological significance of breathy voice is well known, its pragmatic roles have been little studied. We report a systematic exploration of the pragmatic functions of breathy voice in American English, using a small corpus of casual conversations, using the Cepstral Peak Prominence Smoothed measure as an indicator of breathy voice, and using a common workflow to find prosodic constructions and identify their meanings. We found two prosodic constructions involving breathy voice. The first involves a short region of breathy voice in the midst of a region of low pitch, functioning to mark self-directed speech. The second involves breathy voice over several seconds, combined with a moment of wider pitch range leading to a high pitch over about a second, functioning to mark an attempt to establish common ground. These interpretations were confirmed by a perception experiment.QC 20220628Perception of speaker stance – using spontaneous speech synthesis to explore the contribution of prosody, context and speaker (VR-2020-02396)Prosodic functions of voice quality dynamics(VR-2019-02932)CAPTivating – Comparative Analysis of Public speaking with Text-to-speech (P20-0298

Publikationer från KTH

Digitala Vetenskapliga Arkivet - Academic Archive On-line

Perception of smiling voice in spontaneous speech synthesis

Author: Gustafsson Joakim
Kirkland Ambika
Székely Éva
Włodarczak Marcin
Publication venue: International Speech Communication Association
Publication date: 01/01/2021
Field of study

Smiling during speech production has been shown to result in perceptible acoustic differences compared to non-smiling speech. However, there is a scarcity of research on the perception of “smiling voice” in synthesized spontaneous speech. In this study, we used a sequence-to-sequence neural text-tospeech system built on conversational data to produce utterances with the characteristics of spontaneous speech. Segments of speech following laughter, and the same utterances not preceded by laughter, were compared in a perceptual experiment after removing laughter and/or breaths from the beginning of the utterance to determine whether participants perceive the utterances preceded by laughter as sounding as if they were produced while smiling. The results showed that participants identified the post-laughter speech as smiling at a rate significantly greater than chance. Furthermore, the effect of content (positive/neutral/negative) was investigated. These results show that laughter, a spontaneous, non-elicited phenomenon in our model’s training data, can be used to synthesize expressive speech with the perceptual characteristics of smiling.QC 20230616</p

Publikationer från KTH

Digitala Vetenskapliga Arkivet - Academic Archive On-line

Evaluating the impact of disfluencies on the perception of speaker competence using neural speech synthesis

Author: Gustafson Joakim
Kirkland Ambika
Székely Éva
Wlodarczak Marcin
Publication venue: 'KTH Royal Institute of Technology'
Publication date: 01/01/2023
Field of study

CAPTivating – Comparative Analysis of Public Speaking with Text-to-Speech (P20-0298

Publikationer från Stockholms universitet

Evaluating the impact of disfluencies on the perception of speaker competence using neural speech synthesis

Author: Gustafson Joakim
Kirkland Ambika
Székely Éva
Wlodarczak Marcin
Publication venue: 'KTH Royal Institute of Technology'
Publication date: 01/01/2023
Field of study

CAPTivating – Comparative Analysis of Public Speaking with Text-to-Speech (P20-0298

Digitala Vetenskapliga Arkivet - Academic Archive On-line

Speech Data Augmentation for Improving Phoneme Transcriptions of Aphasic Speech Using Wav2Vec 2.0 for the PSST Challenge

Author: Beskow Jonas
Gustafsson Joakim
Kirkland Ambika
Lameris Harm
Mehta Shivam
Moell Birger
O'Regan Jim
Publication venue: Marseille, France
Publication date: 01/01/2022
Field of study

As part of the PSST challenge, we explore how data augmentations, data sources, and model size affect phoneme transcription accuracy on speech produced by individuals with aphasia. We evaluate model performance in terms of feature error rate (FER) and phoneme error rate (PER). We find that data augmentations techniques, such as pitch shift, improve model performance. Additionally, increasing the size of the model decreases FER and PER. Our experiments also show that adding manually-transcribed speech from non-aphasic speakers (TIMIT) improves performance when Room Impulse Response is used to augment the data. The best performing model combines aphasic and non-aphasic data and has a 21.0% PER and a 9.2% FER, a relative improvement of 9.8% compared to the baseline model on the primary outcome measurement. We show that data augmentation, larger model size, and additional non-aphasic data sources can be helpful in improving automatic phoneme recognition models for people with aphasia.QC 20220815</p

Publikationer från KTH

Digitala Vetenskapliga Arkivet - Academic Archive On-line