25 research outputs found
The paradoxical role of emotional intensity in the perception of vocal affect
Vocalizations including laughter, cries, moans, or screams constitute a potent source of information about the affective states of others. It is typically conjectured that the higher the intensity of the expressed emotion, the better the classification of affective information. However, attempts to map the relation between affective intensity and inferred meaning are controversial. Based on a newly developed stimulus database of carefully validated non-speech expressions ranging across the entire intensity spectrum from low to peak, we show that the intuition is false. Based on three experiments (Nâ=â90), we demonstrate that intensity in fact has a paradoxical role. Participants were asked to rate and classify the authenticity, intensity and emotion, as well as valence and arousal of the wide range of vocalizations. Listeners are clearly able to infer expressed intensity and arousal; in contrast, and surprisingly, emotion category and valence have a perceptual sweet spot: moderate and strong emotions are clearly categorized, but peak emotions are maximally ambiguous. This finding, which converges with related observations from visual experiments, raises interesting theoretical challenges for the emotion communication literature
Perception of Nigerian DĂčndĂșn talking drum performances as speech-like vs. music-like: The role of familiarity and acoustic cues
It seems trivial to identify sound sequences as music or speech, particularly when the sequences come from different sound sources, such as an orchestra and a human voice. Can we also easily distinguish these categories when the sequence comes from the same sound source? On the basis of which acoustic features? We investigated these questions by examining listenersâ classification of sound sequences performed by an instrument intertwining both speech and music: the dĂčndĂșn talking drum. The dĂčndĂșn is commonly used in south-west Nigeria as a musical instrument but is also perfectly fit for linguistic usage in what has been described as speech surrogates in Africa. One hundred seven participants from diverse geographical locations (15 different mother tongues represented) took part in an online experiment. Fifty-one participants reported being familiar with the dĂčndĂșn talking drum, 55% of those being speakers of YorĂčbĂĄ. During the experiment, participants listened to 30 dĂčndĂșn samples of about 7s long, performed either as music or YorĂčbĂĄ speech surrogate (n = 15 each) by a professional musician, and were asked to classify each sample as music or speech-like. The classification task revealed the ability of the listeners to identify the samples as intended by the performer, particularly when they were familiar with the dĂčndĂșn, though even unfamiliar participants performed above chance. A logistic regression predicting participantsâ classification of the samples from several acoustic features confirmed the perceptual relevance of intensity, pitch, timbre, and timing measures and their interaction with listener familiarity. In all, this study provides empirical evidence supporting the discriminating role of acoustic features and the modulatory role of familiarity in teasing apart speech and music
The DĂčndĂșn Drum helps us understand how we process speech and music
Every day, you hear many sounds in your environment, like speech, music, animal calls, or passing cars. How do you tease apart these unique categories of sounds? We aimed to understand more about how people distinguish speech and music by using an instrument that can both âspeakâ and play music: the dĂčndĂșn talking drum. We were interested in whether people could tell if the sound produced by the drum was speech or music. People who were familiar with the dĂčndĂșn were good at the task, but so were those who had never heard the dĂčndĂșn, suggesting that there are general characteristics of sound that define speech and music categories. We observed that music is faster, more regular, and more variable in volume than âspeech.â This research helps us understand the interesting instrument that is dĂčndĂșn and provides insights about how humans distinguish two important types of sound: speech and music
Exploring emotional prototypes in a high dimensional TTS latent space
Recent TTS systems are able to generate prosodically varied and realistic speech. However, it is unclear how this prosodic variation contributes to the perception of speakersâ emotional states. Here we use the recent psychological paradigm âGibbs Sampling with Peopleâ to search the prosodic latent space in a trained Global Style Token Tacotron model to explore prototypes of emotional prosody. Participants are recruited online and collectively manipulate the latent space of the generative speech model in a sequentially adaptive way so that the stimulus presented to one group of participants is determined by the response of the previous groups. We demonstrate that (1) particular regions of the modelâs latent space are reliably associated with particular emotions, (2) the resulting emotional prototypes are well-recognized by a separate group of human raters, and (3) these emotional prototypes can be effectively transferred to new sentences. Collectively, these experiments demonstrate a novel approach to the understanding of emotional speech by providing a tool to explore the relation between the latent space of generative models and human semantics
Gibbs sampling with people
A core problem in cognitive science and machine learning is to understand how humans derive semantic representations from perceptual objects, such as color from an apple, pleasantness from a musical chord, or seriousness from a face. Markov Chain Monte Carlo with People (MCMCP) is a prominent method for studying such representations, in which participants are presented with binary choice trials constructed such that the decisions follow a Markov Chain Monte Carlo acceptance rule. However, while MCMCP has strong asymptotic properties, its binary choice paradigm generates relatively little information per trial, and its local proposal function makes it slow to explore the parameter space and find the modes of the distribution. Here we therefore generalize MCMCP to a continuous-sampling paradigm, where in each iteration the participant uses a slider to continuously manipulate a single stimulus dimension to optimize a given criterion such as 'pleasantness'. We formulate both methods from a utility-theory perspective, and show that the new method can be interpreted as 'Gibbs Sampling with People' (GSP). Further, we introduce an aggregation parameter to the transition step, and show that this parameter can be manipulated to flexibly shift between Gibbs sampling and deterministic optimization. In an initial study, we show GSP clearly outperforming MCMCP; we then show that GSP provides novel and interpretable results in three other domains, namely musical chords, vocal emotions, and faces. We validate these results through large-scale perceptual rating experiments. The final experiments use GSP to navigate the latent space of a state-of-the-art image synthesis network (StyleGAN), a promising approach for applying GSP to high-dimensional perceptual spaces. We conclude by discussing future cognitive applications and ethical implications
Gibbs sampling with people
A core problem in cognitive science and machine learning is to understand how
humans derive semantic representations from perceptual objects, such as color
from an apple, pleasantness from a musical chord, or seriousness from a face.
Markov Chain Monte Carlo with People (MCMCP) is a prominent method for studying
such representations, in which participants are presented with binary choice
trials constructed such that the decisions follow a Markov Chain Monte Carlo
acceptance rule. However, while MCMCP has strong asymptotic properties, its
binary choice paradigm generates relatively little information per trial, and
its local proposal function makes it slow to explore the parameter space and
find the modes of the distribution. Here we therefore generalize MCMCP to a
continuous-sampling paradigm, where in each iteration the participant uses a
slider to continuously manipulate a single stimulus dimension to optimize a
given criterion such as 'pleasantness'. We formulate both methods from a
utility-theory perspective, and show that the new method can be interpreted as
'Gibbs Sampling with People' (GSP). Further, we introduce an aggregation
parameter to the transition step, and show that this parameter can be
manipulated to flexibly shift between Gibbs sampling and deterministic
optimization. In an initial study, we show GSP clearly outperforming MCMCP; we
then show that GSP provides novel and interpretable results in three other
domains, namely musical chords, vocal emotions, and faces. We validate these
results through large-scale perceptual rating experiments. The final
experiments use GSP to navigate the latent space of a state-of-the-art image
synthesis network (StyleGAN), a promising approach for applying GSP to
high-dimensional perceptual spaces. We conclude by discussing future cognitive
applications and ethical implications
The Timbre Perception Test (TPT): A new interactive musical assessment tool to measure timbre perception ability
To date, tests that measure individual differences in the ability to perceive musical timbre are scarce in the published literature.The lack of such tool limits research on how timbre, a primary attribute of sound, is perceived and processed among individuals.The current paper describes the development of the Timbre Perception Test (TPT), in which participants use a slider to reproduce heard auditory stimuli that vary along three important dimensions of timbre: envelope, spectral flux, and spectral centroid. With a sample of 95 participants, the TPT was calibrated and validated against measures of related abilities and examined for its reliability. The results indicate that a short-version (8 minutes) of the TPT has good explanatory support from a factor analysis model, acceptable internal reliability (α=.69,Ït = .70), good testâretest reliability (r= .79) and substantial correlations with self-reported general musical sophistication (Ï= .63) and pitch discrimination (Ï= .56), as well as somewhat lower correlations with duration discrimination (Ï= .27), and musical instrument discrimination abilities (Ï= .33). Overall, the TPT represents a robust tool to measure an individualâs timbre perception ability. Furthermore, the use of sliders to perform a reproductive task has shown to be an effective approach in threshold testing. The current version of the TPT is openly available for research purposes
Globally, songs and instrumental melodies are slower and higher and use more stable pitches than speech: A Registered Report
Both music and language are found in all known human societies, yet no studies have compared similarities and differences between song, speech, and instrumental music on a global scale. In this Registered Report, we analyzed two global datasets: (i) 300 annotated audio recordings representing matched sets of traditional songs, recited lyrics, conversational speech, and instrumental melodies from our 75 coauthors speaking 55 languages; and (ii) 418 previously published adult-directed song and speech recordings from 209 individuals speaking 16 languages. Of our six preregistered predictions, five were strongly supported: Relative to speech, songs use (i) higher pitch, (ii) slower temporal rate, and (iii) more stable pitches, while both songs and speech used similar (iv) pitch interval size and (v) timbral brightness. Exploratory analyses suggest that features vary along a âmusi-linguisticâ continuum when including instrumental melodies and recited lyrics. Our study provides strong empirical evidence of cross-cultural regularities in music and speech