118 research outputs found

    No Pitch Left Behind: Addressing Gender Unbalance in Automatic Speech Recognition through Pitch Manipulation

    Full text link
    Automatic speech recognition (ASR) systems are known to be sensitive to the sociolinguistic variability of speech data, in which gender plays a crucial role. This can result in disparities in recognition accuracy between male and female speakers, primarily due to the under-representation of the latter group in the training data. While in the context of hybrid ASR models several solutions have been proposed, the gender bias issue has not been explicitly addressed in end-to-end neural architectures. To fill this gap, we propose a data augmentation technique that manipulates the fundamental frequency (f0) and formants. This technique reduces the data unbalance among genders by simulating voices of the under-represented female speakers and increases the variability within each gender group. Experiments on spontaneous English speech show that our technique yields a relative WER improvement up to 9.87% for utterances by female speakers, with larger gains for the least-represented f0 ranges.Comment: Accepted at ASRU 202

    An Overview of Affective Speech Synthesis and Conversion in the Deep Learning Era

    Get PDF
    Speech is the fundamental mode of human communication, and its synthesis has long been a core priority in human-computer interaction research. In recent years, machines have managed to master the art of generating speech that is understandable by humans. But the linguistic content of an utterance encompasses only a part of its meaning. Affect, or expressivity, has the capacity to turn speech into a medium capable of conveying intimate thoughts, feelings, and emotions -- aspects that are essential for engaging and naturalistic interpersonal communication. While the goal of imparting expressivity to synthesised utterances has so far remained elusive, following recent advances in text-to-speech synthesis, a paradigm shift is well under way in the fields of affective speech synthesis and conversion as well. Deep learning, as the technology which underlies most of the recent advances in artificial intelligence, is spearheading these efforts. In the present overview, we outline ongoing trends and summarise state-of-the-art approaches in an attempt to provide a comprehensive overview of this exciting field.Comment: Submitted to the Proceedings of IEE

    The Good, The Bad, and The Funny: A Neurocognitive Study of Laughter as a Meaningful Socioemotional Cue

    Get PDF
    Laughter is a socioemotional cue that is characteristically positive and historically served to facilitate social bonding. Like other communicative gestures (e.g., facial expressions, groans, sighs), however, the interpretation of laughter is no longer bound to a particular affective state. Thus, an important question is how basic psychological mechanisms, such as early sensory arousal, emotion evaluation, and meaning representation, contribute to the interpretation of laughter in different contexts. A related question is how brain dynamic processes reflect these different aspects of laughter comprehension. The present study addressed these questions using event-related potentials (ERP) to examine laughter comprehension within a cross-modal priming paradigm. Target stimuli were visually presented words, which were preceded by either laughs or environmental sounds (500 ms versions of the International Affective Digitized Sounds, IADS). The study addressed four questions: (1) Does emotion priming lead to N400 effects? (2) Do positive and negative sounds elicit different neurocognitive responses? (3) Are there laughter-specific ERPs? (4) Can laughter priming of good and bad concepts be reversed under social anxiety? Four experiments were conducted. In all four experiments, participants were asked to make speeded judgments about the valence of the target words. Experiments 1-3 examined behavioral effects of emotion priming using variations on this paradigm. In Experiment 4, participants performed the task while their electroencephalographic (EEG) data were recorded. After six experimental blocks, a mood manipulation was administered to activate negative responses to laughter. The task was then repeated. Accuracy and reaction time showed a small but significant priming effect across studies. Surprisingly, N400 effects of emotion priming were absent. Instead, there was a later (~400–600 ms) effect over orbitofrontal electrodes (orbitofrontal priming effect, OPE). Valence-specific effects were observed in the early posterior negativity (EPN, ~275 ms) and in the late positive potential (LPP, ~600 ms). Laughter-specific effects were observed over orbitofrontal sites beginning approximately 200 ms after target onset. Finally, the OPE was observed for laughs before and after the mood manipulation. The direction of priming did not reverse, contrary to hypothesis. Interestingly, the OPE was observed for IADS only prior to the mood manipulation, providing some evidence for laughter-specific effects in emotion priming. These findings question the N400 as a marker of emotion priming and contribute to the understanding of neurocognitive stages of laughter perception. More generally, they add to the growing literature on the neurophysiology of emotion and emotion representation

    Voices Within Voices: Developing a New Analytical Approach to Vocal Timbre by Examining the Interplay of Emotionally Valenced Vocal Timbres and Emotionally Valenced Lyrics

    Get PDF
    Aims/Goals This thesis presents a new analytical technique for vocal timbre based on the hypothesis that emotion expressed in vocal timbre impacts emotional perception of lyrics. Background information Vocal timbre is a highly salient musical feature that, arguably, contributes significantly to our emotional experience of a song. Despite this, analytical techniques for vocal timbre remain in their infancy. Today, this is changing as technological developments increasingly allow for vocal timbre to be preserved and studied in a systematic way. The present research capitalises on these developments, using them to facilitate the examination of how emotional vocal timbres impact emotional perception of lyrics. Methodology Since there exists little empirical research on the hypothesis which underlies this analytical technique, and since the experience of vocal timbre could be considered highly subjective, it was necessary to first experimentally test if/how vocal timbre impacts lyric perception. To this end, a reception test was conducted to examine whether vocal timbre on its own has emotional valence, and whether this emotional valence is salient enough to impact emotional perception of words. Results from this test supported the hypothesis, showing that participants were significantly less accurate at identifying the emotional valence of words when these words were sung with a mismatched emotional vocal timbre. The analytical technique itself is multilayered. First, the recording is taken as the basis of analysis. Then, the vocal timbre is described, and its emotional valence is assessed, through Vocal Timbre Features (a system, inspired by the work of van Leeuwen (1999) defined and developed to aid in describing vocal timbre and, potentially, categorising its emotional valence). Observations made by aurally detecting and annotating the Vocal Timbre Features can be confirmed visually through spectrographs. The synergies between emotions identified in the vocal timbre and that conveyed through lyrics can then be assessed using adapted diagrammatic vocabulary sets (inspired by the work of Dennis Smalley (1986, 1997)). Conclusions In summary, this thesis presents a new analytical technique that allows one to analyse vocal timbre in terms of its emotional meaning, and in terms of how this emotional meaning impacts emotional perception of lyrics. It also offers a framework through which one may conduct efficient, aurally based, analyses of vocal timbre more generally. This thesis has also shown that the experience of emotion in vocal timbre, and its impact on lyric perception, may be similar across listeners (i.e., intersubjective)

    Advances in the neurocognition of music and language

    Get PDF

    The functional significance of cross-sensory correspondences in infant-directed speech

    Get PDF
    Evidence suggesting that infants appreciate a range of cross-sensory correspondences is growing rapidly (see Dolscheid, Hunnius, Casasanto & Majid, 2014; Fernández-Prieto, Navarra & Pons, 2015; Haryu & Kajikawa, 2012; Mondloch & Maurer, 2004; Walker, Bremner, Mason, Spring, Mattock, Slater, & Johnson, 2010; Walker, Bremner, Lunghi, Dolscheid, Barba & Simion, 2018), and yet there is no known attempt to establish the functional significance of these correspondences in infancy. Research shows that speakers manipulate their prosody (i.e. melody of spoken language) to communicate the meaning of unfamiliar words and do so in ways that exploit the cross-sensory correspondences between, for example, pitch and size (Nygaard, Herold & Namy, 2009) and pitch and height (Shintel, Nusbaum & Okrent, 2006). But do infants attend to a speaker’s prosody in this context to interpret the meaning of unfamiliar words? The aim of this thesis is to further establish how infant-directed speakers use prosody to communicate the cross-sensory meanings of words and, for the first time, identify whether infants capitalise on their sensitivity to cross-sensory correspondences to resolve linguistic uncertainty. In Experiment 1 – 4 we identify a list of novel pseudowords to use in all experiments being reported. These pseudowords were judged by participants as being neutral in terms of their sound-symbolic potential, allowing us to rule out the impact of sound-symbolism in our investigation. Experiment 5 provides support for earlier studies revealing cross-sensory correspondences in infant-directed speech. When presented with pseudowords spoken in a prosodically meaningful way, 13-month-old infants demonstrated a preference for objects that were contradictory to the cross-sensory acoustic properties of speech (e.g. lower-pitch voice with higher objects) (Experiment 6), and adults failed to match pseudowords with objects based on the prosodic information that was provided (Experiment 7). However, Experiment 8 provides evidence that 24-month-olds match pseudowords spoken in a higher-pitch voice, and at a faster rate, with objects that are visually higher in space. The implications of these findings are discussed, with suggestions as to how they can be usefully extended
    corecore