7,189 research outputs found

    Exploitation of Phase-Based Features for Whispered Speech Emotion Recognition

    No full text
    Features for speech emotion recognition are usually dominated by the spectral magnitude information while they ignore the use of the phase spectrum because of the difficulty of properly interpreting it. Motivated by recent successes of phase-based features for speech processing, this paper investigates the effectiveness of phase information for whispered speech emotion recognition. We select two types of phase-based features (i.e., modified group delay features and all-pole group delay features), both which have shown wide applicability to all sorts of different speech analysis and are now studied in whispered speech emotion recognition. When exploiting these features, we propose a new speech emotion recognition framework, employing outer product in combination with power and L2 normalization. The according technique encodes any variable length sequence of the phase-based features into a fixed dimension vector regardless of the length of the input sequence. The resulting representation is fed to train a classification model with a linear kernel classifier. Experimental results on the Geneva Whispered Emotion Corpus database, including normal and whispered phonation, demonstrate the effectiveness of the proposed method when compared with other modern systems. It is also shown that, combining phase information with magnitude information could significantly improve performance over the common systems solely adopting magnitude information

    Music and dance as a coalition signaling system

    Get PDF
    Evidence suggests that humans have neurological specializations for music processing, but a compelling adaptationist account of music and dance is lacking. The sexual selection hypothesis cannot easily account for the widespread performance of music and dance in groups (especially synchronized performances), and the social bonding hypothesis has severe theoretical difficulties. Humans are unique among the primates in their ability to form cooperative alliances between groups in the absence of consanguineal ties. We propose that this unique form of social organization is predicated on music and dance. Music and dance may have evolved as a coalition signaling system that could, among other things, credibly communicate coalition quality, thus permitting meaningful cooperative relationships between groups. This capability may have evolved from coordinated territorial defense signals that are common in many social species, including chimpanzees. We present a study in which manipulation of music synchrony significantly altered subjects’ perceptions of music quality, and in which subjects’ perceptions of music quality were correlated with their perceptions of coalition quality, supporting our hypothesis. Our hypothesis also has implications for the evolution of psychological mechanisms underlying cultural production in other domains such as food preparation, clothing and body decoration, storytelling and ritual, and tools and other artifacts

    Digital neuromorphic auditory systems

    Get PDF
    This dissertation presents several digital neuromorphic auditory systems. Neuromorphic systems are capable of running in real-time at a smaller computing cost and consume lower power than on widely available general computers. These auditory systems are considered neuromorphic as they are modelled after computational models of the mammalian auditory pathway and are capable of running on digital hardware, or more specifically on a field-programmable gate array (FPGA). The models introduced are categorised into three parts: a cochlear model, an auditory pitch model, and a functional primary auditory cortical (A1) model. The cochlear model is the primary interface of an input sound signal and transmits the 2D time-frequency representation of the sound to the pitch models as well as to the A1 model. In the pitch model, pitch information is extracted from the sound signal in the form of a fundamental frequency. From the A1 model, timbre information in the form of time-frequency envelope information of the sound signal is extracted. Since the computational auditory models mentioned above are required to be implemented on FPGAs that possess fewer computational resources than general-purpose computers, the algorithms in the models are optimised so that they fit on a single FPGA. The optimisation includes using simplified hardware-implementable signal processing algorithms. Computational resource information of each model on FPGA is extracted to understand the minimum computational resources required to run each model. This information includes the quantity of logic modules, register quantity utilised, and power consumption. Similarity comparisons are also made between the output responses of the computational auditory models on software and hardware using pure tones, chirp signals, frequency-modulated signal, moving ripple signals, and musical signals as input. The limitation of the responses of the models to musical signals at multiple intensity levels is also presented along with the use of an automatic gain control algorithm to alleviate such limitations. With real-world musical signals as their inputs, the responses of the models are also tested using classifiers – the response of the auditory pitch model is used for the classification of monophonic musical notes, and the response of the A1 model is used for the classification of musical instruments with their respective monophonic signals. Classification accuracy results are shown for model output responses on both software and hardware. With the hardware implementable auditory pitch model, the classification score stands at 100% accuracy for musical notes from the 4th and 5th octaves containing 24 classes of notes. With the hardware implementation auditory timbre model, the classification score is 92% accuracy for 12 classes musical instruments. Also presented is the difference in memory requirements of the model output responses on both software and hardware – pitch and timbre responses used for the classification exercises use 24 and 2 times less memory space for hardware than software

    A review of differentiable digital signal processing for music and speech synthesis

    Get PDF
    The term “differentiable digital signal processing” describes a family of techniques in which loss function gradients are backpropagated through digital signal processors, facilitating their integration into neural networks. This article surveys the literature on differentiable audio signal processing, focusing on its use in music and speech synthesis. We catalogue applications to tasks including music performance rendering, sound matching, and voice transformation, discussing the motivations for and implications of the use of this methodology. This is accompanied by an overview of digital signal processing operations that have been implemented differentiably, which is further supported by a web book containing practical advice on differentiable synthesiser programming (https://intro2ddsp.github.io/). Finally, we highlight open challenges, including optimisation pathologies, robustness to real-world conditions, and design trade-offs, and discuss directions for future research

    Predictive cognition in dementia: the case of music

    Get PDF
    The clinical complexity and pathological diversity of neurodegenerative diseases impose immense challenges for diagnosis and the design of rational interventions. To address these challenges, there is a need to identify new paradigms and biomarkers that capture shared pathophysiological processes and can be applied across a range of diseases. One core paradigm of brain function is predictive coding: the processes by which the brain establishes predictions and uses them to minimise prediction errors represented as the difference between predictions and actual sensory inputs. The processes involved in processing unexpected events and responding appropriately are vulnerable in common dementias but difficult to characterise. In my PhD work, I have exploited key properties of music – its universality, ecological relevance and structural regularity – to model and assess predictive cognition in patients representing major syndromes of frontotemporal dementia – non-fluent variant PPA (nfvPPA), semantic-variant PPA (svPPA) and behavioural-variant FTD (bvFTD) - and Alzheimer’s disease relative to healthy older individuals. In my first experiment, I presented patients with well-known melodies containing no deviants or one of three types of deviant - acoustic (white-noise burst), syntactic (key-violating pitch change) or semantic (key-preserving pitch change). I assessed accuracy detecting melodic deviants and simultaneously-recorded pupillary responses to these deviants. I used voxel-based morphometry to define neuroanatomical substrates for the behavioural and autonomic processing of these different types of deviants, and identified a posterior temporo-parietal network for detection of basic acoustic deviants and a more anterior fronto-temporo-striatal network for detection of syntactic pitch deviants. In my second chapter, I investigated the ability of patients to track the statistical structure of the same musical stimuli, using a computational model of the information dynamics of music to calculate the information-content of deviants (unexpectedness) and entropy of melodies (uncertainty). I related these information-theoretic metrics to performance for detection of deviants and to ‘evoked’ and ‘integrative’ pupil reactivity to deviants and melodies respectively and found neuroanatomical correlates in bilateral dorsal and ventral striatum, hippocampus, superior temporal gyri, right temporal pole and left inferior frontal gyrus. Together, chapters 3 and 4 revealed new hypotheses about the way FTD and AD pathologies disrupt the integration of predictive errors with predictions: a retained ability of AD patients to detect deviants at all levels of the hierarchy with a preserved autonomic sensitivity to information-theoretic properties of musical stimuli; a generalized impairment of surprise detection and statistical tracking of musical information at both a cognitive and autonomic levels for svPPA patients underlying a diminished precision of predictions; the exact mirror profile of svPPA patients in nfvPPA patients with an abnormally high rate of false-alarms with up-regulated pupillary reactivity to deviants, interpreted as over-precise or inflexible predictions accompanied with normal cognitive and autonomic probabilistic tracking of information; an impaired behavioural and autonomic reactivity to unexpected events with a retained reactivity to environmental uncertainty in bvFTD patients. Chapters 5 and 6 assessed the status of reward prediction error processing and updating via actions in bvFTD. I created pleasant and aversive musical stimuli by manipulating chord progressions and used a classic reinforcement-learning paradigm which asked participants to choose the visual cue with the highest probability of obtaining a musical ‘reward’. bvFTD patients showed reduced sensitivity to the consequence of an action and lower learning rate in response to aversive stimuli compared to reward. These results correlated with neuroanatomical substrates in ventral and dorsal attention networks, dorsal striatum, parahippocampal gyrus and temporo-parietal junction. Deficits were governed by the level of environmental uncertainty with normal learning dynamics in a structured and binarized environment but exacerbated deficits in noisier environments. Impaired choice accuracy in noisy environments correlated with measures of ritualistic and compulsive behavioural changes and abnormally reduced learning dynamics correlated with behavioural changes related to empathy and theory-of-mind. Together, these experiments represent the most comprehensive attempt to date to define the way neurodegenerative pathologies disrupts the perceptual, behavioural and physiological encoding of unexpected events in predictive coding terms

    Singing voice analysis/synthesis

    Get PDF
    Thesis (Ph. D.)--Massachusetts Institute of Technology, School of Architecture and Planning, Program in Media Arts and Sciences, 2003.Includes bibliographical references (p. 109-115).The singing voice is the oldest and most variable of musical instruments. By combining music, lyrics, and expression, the voice is able to affect us in ways that no other instrument can. As listeners, we are innately drawn to the sound of the human voice, and when present it is almost always the focal point of a musical piece. But the acoustic flexibility of the voice in intimating words, shaping phrases, and conveying emotion also makes it the most difficult instrument to model computationally. Moreover, while all voices are capable of producing the common sounds necessary for language understanding and communication, each voice possesses distinctive features independent of phonemes and words. These unique acoustic qualities are the result of a combination of innate physical factors and expressive characteristics of performance, reflecting an individual's vocal identity. A great deal of prior research has focused on speech recognition and speaker identification, but relatively little work has been performed specifically on singing. There are significant differences between speech and singing in terms of both production and perception. Traditional computational models of speech have focused on the intelligibility of language, often sacrificing sound quality for model simplicity. Such models, however, are detrimental to the goal of singing, which relies on acoustic authenticity for the non-linguistic communication of expression and emotion. These differences between speech and singing dictate that a different and specialized representation is needed to capture the sound quality and musicality most valued in singing.(cont.) This dissertation proposes an analysis/synthesis framework specifically for the singing voice that models the time-varying physical and expressive characteristics unique to an individual voice. The system operates by jointly estimating source-filter voice model parameters, representing vocal physiology, and modeling the dynamic behavior of these features over time to represent aspects of expression. This framework is demonstrated to be useful for several applications, such as singing voice coding, automatic singer identification, and voice transformation.by Youngmoo Edmund Kim.Ph.D

    Ultrasound cleaning of microfilters

    Get PDF
    corecore